wenxuanguan commented on issue #25618: [SPARK-28908][SS]Implement Kafka EOS 
sink for Structured Streaming
URL: https://github.com/apache/spark/pull/25618#issuecomment-526481232
 
 
   > Spark doesn't have semantics of 2PC natively as you've seen DSv2 API - If 
I understand correctly, Spark HDFS sink doesn't leverage 2PC.
   > 
   > Previously it used temporal directory - let all tasks write to that 
directory, and driver move that directory to final destination only when all 
tasks succeed to write. It leverages the fact that "rename" is atomic, so it 
didn't support "exactly-once" if underlying filesystem doesn't support atomic 
renaming.
   > 
   > Now it leverages metadata - let all tasks write files, and pass the list 
of files (path) written to driver. When driver receives all list of written 
files from all tasks, driver writes overall list of files to metadata. So 
exactly-once for HDFS is only guaranteed when "Spark" reads the output which is 
aware of metadata information.
   
   Sorry for late reply.
   In my understand that is the procedure of 2PC.
   The voting phase every task write data and return commit message to driver. 
In the commit phase, when all tasks completed successfully, the driver commit 
job with rename, or abort job if any task failed to commit or job commit failed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to