HeartSaVioR edited a comment on issue #25618: [SPARK-28908][SS]Implement Kafka EOS sink for Structured Streaming URL: https://github.com/apache/spark/pull/25618#issuecomment-531519195 > About a new Kafka API to resolve Kafka transaction in distributed system, as @HeartSaVioR mentioned above, Kafka producer transaction is not provided only for Kafka Stream, and a new API for Spark/Flink/HIve may be customized. So I also think we should adapt Spark/Flink/Hive to it. Sorry you are understanding my comment in opposite way. My claim was that Kafka producer transaction is designed "for" Kafka Stream. Please take a look at my comment thoughtfully. https://cwiki.apache.org/confluence/display/KAFKA/KIP-129%3A+Streams+Exactly-Once+Semantics According to design doc, Kafka community took the approach "transaction per task": > In this design we take the approach to assign a separate producer per task so that any transaction contains only output messages of a single task. which they never need to worry about transaction across multiple connections/JVMs - unlike other streaming frameworks. According to the information I guess Kafka stream should leverage Kafka topic as shuffle storage and have multiple connected `read-process-write` topologies to run user application. (So ensuring exactly-once for all of connected parts brings exactly-once for overall graph/) That's completely coupled with Kafka and Spark can't (and shouldn't) do the same.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org