Martin Andersson created SPARK-38445:
----------------------------------------

             Summary: Are hadoop committers used in Structured Streaming?
                 Key: SPARK-38445
                 URL: https://issues.apache.org/jira/browse/SPARK-38445
             Project: Spark
          Issue Type: Question
          Components: Spark Core
    Affects Versions: 3.2.1
            Reporter: Martin Andersson


At the company I work at we're using Spark Structured Streaming to sink 
messages on kafka to HDFS. We're in the late stages of migrating this component 
to instead sink messages to AWS S3, and in connection with that we hit upon a 
couple of issues regarding hadoop committers.

I've come to understand that the default "file" committer (documented 
[here|https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/committers.html#Switching_to_an_S3A_Committer])
 is unsafe to use in S3, which is why this page in the spark documentation 
recommends using the "directory" (i.e. staging) committer, and later versions 
also recommends to use the "magic" committer.

However, it's not clear whether spark structured streaming even use committers. 
There's no "_SUCCESS" file in destination (as compared to normal spark jobs), 
and the documentation regarding committers used in streaming is non-existent.

Can anyone please shed some light on this?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to