Re: Data loss - Spark streaming and network receiver

Dibyendu Bhattacharya Mon, 18 Aug 2014 21:32:46 -0700

Dear All,

Recently I have written a Spark Kafka Consumer to solve this problem. Even
we have seen issues with KafkaUtils which is using Highlevel Kafka Consumer
and consumer code has no handle to offset management.


The below code solves this problem, and this has is being tested in our
Spark Cluster and this working fine as of now.

https://github.com/dibbhatt/kafka-spark-consumer

This is Low Level Kafka Consumer using Kafka Simple Consumer API.

Please have a look at it and let me know your opinion. This has been
written to eliminate the Data loss by committing the offset after it is
written to BM. Also existing HighLevel KafkaUtils does not have any feature
to control Data Flow, and is gives Out Of Memory error is there is too much
backlogs in Kafka. This consumer solves this problem as well.  And this
code has been modified from earlier Storm Kafka consumer code and it has
lot of other features like recovery from Kafka node failures, ZK failures,
recover from Offset errors etc.

Regards,
Dibyendu


On Tue, Aug 19, 2014 at 9:49 AM, Shao, Saisai <saisai.s...@intel.com> wrote:

>  I think Currently Spark Streaming lack a data acknowledging mechanism
> when data is stored and replicated in BlockManager, so potentially data
> will be lost even pulled into Kafka, say if data is stored just in
> BlockGenerator not BM, while in the meantime Kafka itself commit the
> consumer offset, also at this point node is failed, from Kafka’s point this
> part of data is feed into Spark Streaming but actually this data is not yet
> processed, so potentially this part of data will never be processed again,
> unless you read the whole partition again.
>
>
>
> To solve this potential data loss problem, Spark Streaming needs to offer
> a data acknowledging mechanism, so custom Receiver can use this
> acknowledgement to do checkpoint or recovery, like Storm.
>
>
>
> Besides, driver failure is another story need to be carefully considered.
> So currently it is hard to make sure no data loss in Spark Streaming, still
> need to improve at some points J.
>
>
>
> Thanks
>
> Jerry
>
>
>
> *From:* Tobias Pfeiffer [mailto:t...@preferred.jp]
> *Sent:* Tuesday, August 19, 2014 10:47 AM
> *To:* Wei Liu
> *Cc:* user
> *Subject:* Re: Data loss - Spark streaming and network receiver
>
>
>
> Hi Wei,
>
>
>
> On Tue, Aug 19, 2014 at 10:18 AM, Wei Liu <wei....@stellarloyalty.com>
> wrote:
>
> Since our application cannot tolerate losing customer data, I am wondering
> what is the best way for us to address this issue.
>
> 1) We are thinking writing application specific logic to address the data
> loss. To us, the problem seems to be caused by that Kinesis receivers
> advanced their checkpoint before we know for sure the data is replicated.
> For example, we can do another checkpoint ourselves to remember the kinesis
> sequence number for data that has been processed by spark streaming. When
> Kinesis receiver is restarted due to worker failures, we restarted it from
> the checkpoint we tracked.
>
>
>
> This sounds pretty much to me like the way Kafka does it. So, I am not
> saying that the stock KafkaReceiver does what you want (it may or may not),
> but it should be possible to update the "offset" (corresponds to "sequence
> number") in Zookeeper only after data has been replicated successfully. I
> guess "replace Kinesis by Kafka" is not in option for you, but you may
> consider pulling Kinesis data into Kafka before processing with Spark?
>
>
>
> Tobias
>
>
>

Re: Data loss - Spark streaming and network receiver

Reply via email to