[ 
https://issues.apache.org/jira/browse/SPARK-17815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15559541#comment-15559541
 ] 

Ofir Manor commented on SPARK-17815:
------------------------------------

As far as I understand, there is a clear single-point-of-truth: the structured 
streaming "commit log" - the checkpoint. It holds both the source state 
(offsets) and the Spark state (aggregations) of successfully finished batches 
atomically, and is the one that is used during recovery to identify the correct 
beginning offset in the source during recovery.
The structured WAL is a technical, internal implementation detail, that stores 
an intention to process a range of offsets, before they are actually read. 
Spark used it during recovery to repeat the same source end boundary to a 
failed batch.
The data in the downstream store is about Spark output - which [version,spark 
partition] have landed - not about source state. Of course, it is being used 
during Spark recovery / retry, but not as a basis to choose a offsets in the 
source (it is used to skip specific output version-partitions there were 
already written).
As this ticket states, updating the Kafka consumer group offsets in Kafka is 
only for easier progress monitoring using Kafka-specific tools. So, it should 
be considered informational, after-the-fact updating just for being nice, as it 
won't be used for Spark recovery. If a user want to manually recover, it should 
rely on the Spark checkpoint offset.
In other words, updating Kafka offsets after a batch successfully commited 
means that the offsets in Kafka represent which messages have been successfully 
processed and landed in the sink, not which messages have been read. 
[~marmbrus] Is my understanding correct?

> Report committed offsets
> ------------------------
>
>                 Key: SPARK-17815
>                 URL: https://issues.apache.org/jira/browse/SPARK-17815
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>            Reporter: Michael Armbrust
>
> Since we manage our own offsets, we have turned off auto-commit.  However, 
> this means that external tools are not able to report on how far behind a 
> given streaming job is.  When the user manually gives us a group.id, we 
> should report back to it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to