I think Currently Spark Streaming lack a data acknowledging mechanism when data 
is stored and replicated in BlockManager, so potentially data will be lost even 
pulled into Kafka, say if data is stored just in BlockGenerator not BM, while 
in the meantime Kafka itself commit the consumer offset, also at this point 
node is failed, from Kafka’s point this part of data is feed into Spark 
Streaming but actually this data is not yet processed, so potentially this part 
of data will never be processed again, unless you read the whole partition 
again.

To solve this potential data loss problem, Spark Streaming needs to offer a 
data acknowledging mechanism, so custom Receiver can use this acknowledgement 
to do checkpoint or recovery, like Storm.

Besides, driver failure is another story need to be carefully considered. So 
currently it is hard to make sure no data loss in Spark Streaming, still need 
to improve at some points ☺.

Thanks
Jerry

From: Tobias Pfeiffer [mailto:t...@preferred.jp]
Sent: Tuesday, August 19, 2014 10:47 AM
To: Wei Liu
Cc: user
Subject: Re: Data loss - Spark streaming and network receiver

Hi Wei,

On Tue, Aug 19, 2014 at 10:18 AM, Wei Liu 
<wei....@stellarloyalty.com<mailto:wei....@stellarloyalty.com>> wrote:
Since our application cannot tolerate losing customer data, I am wondering what 
is the best way for us to address this issue.
1) We are thinking writing application specific logic to address the data loss. 
To us, the problem seems to be caused by that Kinesis receivers advanced their 
checkpoint before we know for sure the data is replicated. For example, we can 
do another checkpoint ourselves to remember the kinesis sequence number for 
data that has been processed by spark streaming. When Kinesis receiver is 
restarted due to worker failures, we restarted it from the checkpoint we 
tracked.

This sounds pretty much to me like the way Kafka does it. So, I am not saying 
that the stock KafkaReceiver does what you want (it may or may not), but it 
should be possible to update the "offset" (corresponds to "sequence number") in 
Zookeeper only after data has been replicated successfully. I guess "replace 
Kinesis by Kafka" is not in option for you, but you may consider pulling 
Kinesis data into Kafka before processing with Spark?

Tobias

Reply via email to