I think Currently Spark Streaming lack a data acknowledging mechanism when data is stored and replicated in BlockManager, so potentially data will be lost even pulled into Kafka, say if data is stored just in BlockGenerator not BM, while in the meantime Kafka itself commit the consumer offset, also at this point node is failed, from Kafka’s point this part of data is feed into Spark Streaming but actually this data is not yet processed, so potentially this part of data will never be processed again, unless you read the whole partition again.
To solve this potential data loss problem, Spark Streaming needs to offer a data acknowledging mechanism, so custom Receiver can use this acknowledgement to do checkpoint or recovery, like Storm. Besides, driver failure is another story need to be carefully considered. So currently it is hard to make sure no data loss in Spark Streaming, still need to improve at some points ☺. Thanks Jerry From: Tobias Pfeiffer [mailto:t...@preferred.jp] Sent: Tuesday, August 19, 2014 10:47 AM To: Wei Liu Cc: user Subject: Re: Data loss - Spark streaming and network receiver Hi Wei, On Tue, Aug 19, 2014 at 10:18 AM, Wei Liu <wei....@stellarloyalty.com<mailto:wei....@stellarloyalty.com>> wrote: Since our application cannot tolerate losing customer data, I am wondering what is the best way for us to address this issue. 1) We are thinking writing application specific logic to address the data loss. To us, the problem seems to be caused by that Kinesis receivers advanced their checkpoint before we know for sure the data is replicated. For example, we can do another checkpoint ourselves to remember the kinesis sequence number for data that has been processed by spark streaming. When Kinesis receiver is restarted due to worker failures, we restarted it from the checkpoint we tracked. This sounds pretty much to me like the way Kafka does it. So, I am not saying that the stock KafkaReceiver does what you want (it may or may not), but it should be possible to update the "offset" (corresponds to "sequence number") in Zookeeper only after data has been replicated successfully. I guess "replace Kinesis by Kafka" is not in option for you, but you may consider pulling Kinesis data into Kafka before processing with Spark? Tobias