[jira] [Commented] (SPARK-38445) Are hadoop committers used in Structured Streaming?

Steve Loughran (Jira) Tue, 05 Apr 2022 09:48:04 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-38445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17517556#comment-17517556
 ]


Steve Loughran commented on SPARK-38445:
----------------------------------------

not suppoorted unless you provide the PR for a new committer.

hadoop 3.3.1 added an abort() call on an output stream in order to make a 
zero-rename committer possible here...you would initiate a write to the final 
destination, but call abort() before close() if you needed to abort. as no 
output will appear if the process dies, failures won't be visible (billable of 
course, if you don't purge uploads)

> Are hadoop committers used in Structured Streaming?
> ---------------------------------------------------
>
>                 Key: SPARK-38445
>                 URL: https://issues.apache.org/jira/browse/SPARK-38445
>             Project: Spark
>          Issue Type: Question
>          Components: Spark Core
>    Affects Versions: 3.2.1
>            Reporter: Martin Andersson
>            Priority: Major
>              Labels: structured-streaming
>
> At the company I work at we're using Spark Structured Streaming to sink 
> messages on kafka to HDFS. We're in the late stages of migrating this 
> component to instead sink messages to AWS S3, and in connection with that we 
> hit upon a couple of issues regarding hadoop committers.
> I've come to understand that the default "file" committer (documented 
> [here|https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/committers.html#Switching_to_an_S3A_Committer])
>  is unsafe to use in S3, which is why [this page in the spark 
> documentation|https://spark.apache.org/docs/3.2.1/cloud-integration.html] 
> recommends using the "directory" (i.e. staging) committer, and in later 
> versions of hadoop they also recommend to use the "magic" committer.
> However, it's not clear whether spark structured streaming even use 
> committers. There's no "_SUCCESS" file in destination (as compared to normal 
> spark jobs), and the documentation regarding committers used in streaming is 
> non-existent.
> Can anyone please shed some light on this?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38445) Are hadoop committers used in Structured Streaming?

Reply via email to