wenxuanguan commented on issue #25618: [SPARK-28908][SS]Implement Kafka EOS sink for Structured Streaming URL: https://github.com/apache/spark/pull/25618#issuecomment-526481232 > Spark doesn't have semantics of 2PC natively as you've seen DSv2 API - If I understand correctly, Spark HDFS sink doesn't leverage 2PC. > > Previously it used temporal directory - let all tasks write to that directory, and driver move that directory to final destination only when all tasks succeed to write. It leverages the fact that "rename" is atomic, so it didn't support "exactly-once" if underlying filesystem doesn't support atomic renaming. > > Now it leverages metadata - let all tasks write files, and pass the list of files (path) written to driver. When driver receives all list of written files from all tasks, driver writes overall list of files to metadata. So exactly-once for HDFS is only guaranteed when "Spark" reads the output which is aware of metadata information. Sorry for late reply. In my understand that is the procedure of 2PC. The voting phase every task write data and return commit message to driver. In the commit phase, when all tasks completed successfully, the driver commit job with rename, or abort job if any task failed to commit or job commit failed.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org