[ 
https://issues.apache.org/jira/browse/SPARK-17815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15560661#comment-15560661
 ] 

Ofir Manor commented on SPARK-17815:
------------------------------------

I hope others with better understanding of Spark internals can comment, but 
still:
1. "The kafka commit log cant be ignored as merely for metric collection 
either" - that is exactly how I read this specific ticket (title and 
description)...
The way I understand it, when starting a Structured Streaming job for the first 
time, as of current trunk, a new consumer group is generated, with offsets 
being set based on a Spark source option, not based on Kafka defaults. After 
failure, the offsets of the consumer group in Kafka are ignored, on purpose, 
and are overidden (seek?) by the Structured Streaming infra based on its 
internal checkpoint. So, I don't get your comment about it. If during recovery 
the offsets are not set correctly or some corner case / exception is not 
handled, it is probably a bug in the Structured Streaming Kafka source that 
should be reported and fixed.
2. Regarding your WAL comment - not sure you are accurate. The WAL should be 
written to HDFS at the beginning of a batch, and a checkpoint at the end of the 
batch. So, not sure to which corruption scenario do you imply? You are not 
referring to HDFS bugs, right? Is it just a potential vector for Spark-induced 
file corruptions? I assume that generally, if the checkpoint is corrupted or if 
the WAL is corrupted, the job would fail, as it can't guarantee exactly-once.

> Report committed offsets
> ------------------------
>
>                 Key: SPARK-17815
>                 URL: https://issues.apache.org/jira/browse/SPARK-17815
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>            Reporter: Michael Armbrust
>
> Since we manage our own offsets, we have turned off auto-commit.  However, 
> this means that external tools are not able to report on how far behind a 
> given streaming job is.  When the user manually gives us a group.id, we 
> should report back to it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to