[ https://issues.apache.org/jira/browse/SPARK-17815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15560661#comment-15560661 ]
Ofir Manor commented on SPARK-17815: ------------------------------------ I hope others with better understanding of Spark internals can comment, but still: 1. "The kafka commit log cant be ignored as merely for metric collection either" - that is exactly how I read this specific ticket (title and description)... The way I understand it, when starting a Structured Streaming job for the first time, as of current trunk, a new consumer group is generated, with offsets being set based on a Spark source option, not based on Kafka defaults. After failure, the offsets of the consumer group in Kafka are ignored, on purpose, and are overidden (seek?) by the Structured Streaming infra based on its internal checkpoint. So, I don't get your comment about it. If during recovery the offsets are not set correctly or some corner case / exception is not handled, it is probably a bug in the Structured Streaming Kafka source that should be reported and fixed. 2. Regarding your WAL comment - not sure you are accurate. The WAL should be written to HDFS at the beginning of a batch, and a checkpoint at the end of the batch. So, not sure to which corruption scenario do you imply? You are not referring to HDFS bugs, right? Is it just a potential vector for Spark-induced file corruptions? I assume that generally, if the checkpoint is corrupted or if the WAL is corrupted, the job would fail, as it can't guarantee exactly-once. > Report committed offsets > ------------------------ > > Key: SPARK-17815 > URL: https://issues.apache.org/jira/browse/SPARK-17815 > Project: Spark > Issue Type: Sub-task > Components: SQL > Reporter: Michael Armbrust > > Since we manage our own offsets, we have turned off auto-commit. However, > this means that external tools are not able to report on how far behind a > given streaming job is. When the user manually gives us a group.id, we > should report back to it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org