[ 
https://issues.apache.org/jira/browse/SPARK-4707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14232425#comment-14232425
 ] 

Hari Shreedharan commented on SPARK-4707:
-----------------------------------------

No, not really. There is really only one case where we'd lose data - that is 
when the store fails and the receiver is still active. There are two ways of 
working around this:
* Kill the consumer without committing the offsets and start a new consumer 
which will start reading data from the last commit (this is the easiest one, 
but is sort of expensive to create new consumers and also causes duplicates due 
to rebalancing).
* In the second option, store all of the pending messages in an ordered buffer 
locally in the receiver and try to push the data again on failure (on success 
just clear the buffer and commit). Finally, once the data is pushed commit the 
offset and start reading from Kafka again (commit offsets only when there are 
no pending messages). To make this smarter, we can keep track of how many 
messages are each block for each topic and partition and commit



> Reliable Kafka Receiver can lose data if the block generator fails to store 
> data
> --------------------------------------------------------------------------------
>
>                 Key: SPARK-4707
>                 URL: https://issues.apache.org/jira/browse/SPARK-4707
>             Project: Spark
>          Issue Type: Bug
>          Components: Streaming
>    Affects Versions: 1.2.0
>            Reporter: Hari Shreedharan
>
> The Reliable Kafka Receiver commits offsets only when events are actually 
> stored, which ensures that on restart we will actually start where we left 
> off. But if the failure happens in the store() call, and the block generator 
> reports an error the receiver does not do anything and will continue reading 
> from the current offset and not the last commit. This means that messages 
> between the last commit and the current offset will be lost. 
> I will send a PR for this soon - I have a patch which needs some minor fixes, 
> which I need to test.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to