HeartSaVioR edited a comment on issue #25618: [SPARK-28908][SS]Implement Kafka 
EOS sink for Structured Streaming
URL: https://github.com/apache/spark/pull/25618#issuecomment-526394910
 
 
   Just skimmed the design doc (need to take a look deeply on fault tolerance) 
and it's basically known approach what Flink is doing (2PC). Please mention 
what you've inspired of, for same reason, credit.
   
   I planned to propose similar before (in last year, haven't proposed the 
design itself), more clearly I've asked to support 2PC in DSv2 API level as 
Spark doesn't support 2PC natively, but feedback wasn't positive as it should 
be very invasive change on Spark codebase. There has been more cases asking for 
exactly-once write, and I guess the common answer was leveraging intermediate 
output. While some storage can leverage it (e.g. RDBMS - writers write to temp 
table, driver copies rows that writers reported to output table), it doesn't 
make sense for Kafka, at least performance reason, as there's no way to let 
Kafka copies its records from topic A to topic B (right?), so I gave up.
   
   If the code change implements 2PC correctly, in general I guess it would 
work in many cases, though as it's explained that transaction timeout leads 
data loss. I've indicated the issue on transaction timeout when I designed it 
and that was also one of major concerns as well. When the producer writes 
something it must be committed within timeout in any kinds of failures, 
otherwise "data loss" happen. Even we decide to invalidate that batch and rerun 
the batch, we're now then "at-least-once" as some partitions already committed 
successfully. (I'm wondering Flink's Kafka producer with 2PC also has similar 
issue or they have some safeguard.)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to