HeartSaVioR edited a comment on issue #25618: [SPARK-28908][SS]Implement Kafka EOS sink for Structured Streaming URL: https://github.com/apache/spark/pull/25618#issuecomment-526394910 Just skimmed the design doc (need to take a look deeply on fault tolerance) and it's basically known approach what Flink is doing (2PC). Please mention what you've inspired of, for same reason, credit. I planned to propose similar before (in last year, haven't proposed the design itself), more clearly I've asked to support 2PC in DSv2 API level as Spark doesn't support 2PC natively, but feedback wasn't positive as it should be very invasive change on Spark codebase. There has been more cases asking for exactly-once write, and I guess the common answer was leveraging intermediate output. While some storage can leverage it (e.g. RDBMS - writers write to temp table, driver copies rows that writers reported to output table), it doesn't make sense for Kafka, at least performance reason, as there's no way to let Kafka copies its records from topic A to topic B (right?), so I gave up. If the code change implements 2PC correctly, in general I guess it would work in many cases, though as it's explained that transaction timeout leads data loss. I've indicated the issue on transaction timeout when I designed it and that was also one of major concerns as well. When the producer writes something it must be committed within timeout in any kinds of failures, otherwise "data loss" happen. Even we decide to invalidate that batch and rerun the batch, we're now then "at-least-once" as some partitions already committed successfully. (I'm wondering Flink's Kafka producer with 2PC also has similar issue or they have some safeguard.)
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org