[ https://issues.apache.org/jira/browse/SPARK-17815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15559541#comment-15559541 ]
Ofir Manor commented on SPARK-17815: ------------------------------------ As far as I understand, there is a clear single-point-of-truth: the structured streaming "commit log" - the checkpoint. It holds both the source state (offsets) and the Spark state (aggregations) of successfully finished batches atomically, and is the one that is used during recovery to identify the correct beginning offset in the source during recovery. The structured WAL is a technical, internal implementation detail, that stores an intention to process a range of offsets, before they are actually read. Spark used it during recovery to repeat the same source end boundary to a failed batch. The data in the downstream store is about Spark output - which [version,spark partition] have landed - not about source state. Of course, it is being used during Spark recovery / retry, but not as a basis to choose a offsets in the source (it is used to skip specific output version-partitions there were already written). As this ticket states, updating the Kafka consumer group offsets in Kafka is only for easier progress monitoring using Kafka-specific tools. So, it should be considered informational, after-the-fact updating just for being nice, as it won't be used for Spark recovery. If a user want to manually recover, it should rely on the Spark checkpoint offset. In other words, updating Kafka offsets after a batch successfully commited means that the offsets in Kafka represent which messages have been successfully processed and landed in the sink, not which messages have been read. [~marmbrus] Is my understanding correct? > Report committed offsets > ------------------------ > > Key: SPARK-17815 > URL: https://issues.apache.org/jira/browse/SPARK-17815 > Project: Spark > Issue Type: Sub-task > Components: SQL > Reporter: Michael Armbrust > > Since we manage our own offsets, we have turned off auto-commit. However, > this means that external tools are not able to report on how far behind a > given streaming job is. When the user manually gives us a group.id, we > should report back to it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org