Re: Data Loss - Spark streaming
Hi Jeniba, The second part of this meetup recording has a very good answer to your question. TD explains the current behavior and the on-going work in Spark Streaming to fix HA. https://www.youtube.com/watch?v=jcJq3ZalXD8 -kr, Gerard. On Tue, Dec 16, 2014 at 11:32 AM, Jeniba Johnson jeniba.john...@lntinfotech.com wrote: Hi, I need a clarification, while running streaming examples, suppose the batch interval is set to 5 minutes, after collecting the data from the input source(FLUME) and processing till 5 minutes. What will happen to the data which is flowing continuously from the input source to spark streaming ? Will that data be stored somewhere or else the data will be lost ? Or else what is the solution to capture each and every data without any loss in Spark streaming. Awaiting for your kind reply. Regards, Jeniba Johnson The contents of this e-mail and any attachment(s) may contain confidential or privileged information for the intended recipient(s). Unintended recipients are prohibited from taking action on the basis of information in this e-mail and using or disseminating the information, and must notify the sender and delete it from their system. LT Infotech will not accept responsibility or liability for the accuracy or completeness of, or the presence of any virus or disabling code in this e-mail
Re: Data Loss - Spark streaming
TD's portion seems to start at 27:24: http://youtu.be/jcJq3ZalXD8?t=27m24s On Tue Dec 16 2014 at 7:13:43 AM Gerard Maas gerard.m...@gmail.com wrote: Hi Jeniba, The second part of this meetup recording has a very good answer to your question. TD explains the current behavior and the on-going work in Spark Streaming to fix HA. https://www.youtube.com/watch?v=jcJq3ZalXD8 -kr, Gerard. On Tue, Dec 16, 2014 at 11:32 AM, Jeniba Johnson jeniba.john...@lntinfotech.com wrote: Hi, I need a clarification, while running streaming examples, suppose the batch interval is set to 5 minutes, after collecting the data from the input source(FLUME) and processing till 5 minutes. What will happen to the data which is flowing continuously from the input source to spark streaming ? Will that data be stored somewhere or else the data will be lost ? Or else what is the solution to capture each and every data without any loss in Spark streaming. Awaiting for your kind reply. Regards, Jeniba Johnson The contents of this e-mail and any attachment(s) may contain confidential or privileged information for the intended recipient(s). Unintended recipients are prohibited from taking action on the basis of information in this e-mail and using or disseminating the information, and must notify the sender and delete it from their system. LT Infotech will not accept responsibility or liability for the accuracy or completeness of, or the presence of any virus or disabling code in this e-mail
Re: Data loss - Spark streaming and network receiver
Hi Wei, On Tue, Aug 19, 2014 at 10:18 AM, Wei Liu wei@stellarloyalty.com wrote: Since our application cannot tolerate losing customer data, I am wondering what is the best way for us to address this issue. 1) We are thinking writing application specific logic to address the data loss. To us, the problem seems to be caused by that Kinesis receivers advanced their checkpoint before we know for sure the data is replicated. For example, we can do another checkpoint ourselves to remember the kinesis sequence number for data that has been processed by spark streaming. When Kinesis receiver is restarted due to worker failures, we restarted it from the checkpoint we tracked. This sounds pretty much to me like the way Kafka does it. So, I am not saying that the stock KafkaReceiver does what you want (it may or may not), but it should be possible to update the offset (corresponds to sequence number) in Zookeeper only after data has been replicated successfully. I guess replace Kinesis by Kafka is not in option for you, but you may consider pulling Kinesis data into Kafka before processing with Spark? Tobias
RE: Data loss - Spark streaming and network receiver
I think Currently Spark Streaming lack a data acknowledging mechanism when data is stored and replicated in BlockManager, so potentially data will be lost even pulled into Kafka, say if data is stored just in BlockGenerator not BM, while in the meantime Kafka itself commit the consumer offset, also at this point node is failed, from Kafka’s point this part of data is feed into Spark Streaming but actually this data is not yet processed, so potentially this part of data will never be processed again, unless you read the whole partition again. To solve this potential data loss problem, Spark Streaming needs to offer a data acknowledging mechanism, so custom Receiver can use this acknowledgement to do checkpoint or recovery, like Storm. Besides, driver failure is another story need to be carefully considered. So currently it is hard to make sure no data loss in Spark Streaming, still need to improve at some points ☺. Thanks Jerry From: Tobias Pfeiffer [mailto:t...@preferred.jp] Sent: Tuesday, August 19, 2014 10:47 AM To: Wei Liu Cc: user Subject: Re: Data loss - Spark streaming and network receiver Hi Wei, On Tue, Aug 19, 2014 at 10:18 AM, Wei Liu wei@stellarloyalty.commailto:wei@stellarloyalty.com wrote: Since our application cannot tolerate losing customer data, I am wondering what is the best way for us to address this issue. 1) We are thinking writing application specific logic to address the data loss. To us, the problem seems to be caused by that Kinesis receivers advanced their checkpoint before we know for sure the data is replicated. For example, we can do another checkpoint ourselves to remember the kinesis sequence number for data that has been processed by spark streaming. When Kinesis receiver is restarted due to worker failures, we restarted it from the checkpoint we tracked. This sounds pretty much to me like the way Kafka does it. So, I am not saying that the stock KafkaReceiver does what you want (it may or may not), but it should be possible to update the offset (corresponds to sequence number) in Zookeeper only after data has been replicated successfully. I guess replace Kinesis by Kafka is not in option for you, but you may consider pulling Kinesis data into Kafka before processing with Spark? Tobias
Re: Data loss - Spark streaming and network receiver
Dear All, Recently I have written a Spark Kafka Consumer to solve this problem. Even we have seen issues with KafkaUtils which is using Highlevel Kafka Consumer and consumer code has no handle to offset management. The below code solves this problem, and this has is being tested in our Spark Cluster and this working fine as of now. https://github.com/dibbhatt/kafka-spark-consumer This is Low Level Kafka Consumer using Kafka Simple Consumer API. Please have a look at it and let me know your opinion. This has been written to eliminate the Data loss by committing the offset after it is written to BM. Also existing HighLevel KafkaUtils does not have any feature to control Data Flow, and is gives Out Of Memory error is there is too much backlogs in Kafka. This consumer solves this problem as well. And this code has been modified from earlier Storm Kafka consumer code and it has lot of other features like recovery from Kafka node failures, ZK failures, recover from Offset errors etc. Regards, Dibyendu On Tue, Aug 19, 2014 at 9:49 AM, Shao, Saisai saisai.s...@intel.com wrote: I think Currently Spark Streaming lack a data acknowledging mechanism when data is stored and replicated in BlockManager, so potentially data will be lost even pulled into Kafka, say if data is stored just in BlockGenerator not BM, while in the meantime Kafka itself commit the consumer offset, also at this point node is failed, from Kafka’s point this part of data is feed into Spark Streaming but actually this data is not yet processed, so potentially this part of data will never be processed again, unless you read the whole partition again. To solve this potential data loss problem, Spark Streaming needs to offer a data acknowledging mechanism, so custom Receiver can use this acknowledgement to do checkpoint or recovery, like Storm. Besides, driver failure is another story need to be carefully considered. So currently it is hard to make sure no data loss in Spark Streaming, still need to improve at some points J. Thanks Jerry *From:* Tobias Pfeiffer [mailto:t...@preferred.jp] *Sent:* Tuesday, August 19, 2014 10:47 AM *To:* Wei Liu *Cc:* user *Subject:* Re: Data loss - Spark streaming and network receiver Hi Wei, On Tue, Aug 19, 2014 at 10:18 AM, Wei Liu wei@stellarloyalty.com wrote: Since our application cannot tolerate losing customer data, I am wondering what is the best way for us to address this issue. 1) We are thinking writing application specific logic to address the data loss. To us, the problem seems to be caused by that Kinesis receivers advanced their checkpoint before we know for sure the data is replicated. For example, we can do another checkpoint ourselves to remember the kinesis sequence number for data that has been processed by spark streaming. When Kinesis receiver is restarted due to worker failures, we restarted it from the checkpoint we tracked. This sounds pretty much to me like the way Kafka does it. So, I am not saying that the stock KafkaReceiver does what you want (it may or may not), but it should be possible to update the offset (corresponds to sequence number) in Zookeeper only after data has been replicated successfully. I guess replace Kinesis by Kafka is not in option for you, but you may consider pulling Kinesis data into Kafka before processing with Spark? Tobias
Re: Data loss - Spark streaming and network receiver
Thank you all for responding to my question. I am pleasantly surprised by this many prompt responses I got. It shows the strength of the spark community. Kafka is still an option for us, I will check out the link provided by Dibyendu. Meanwhile if someone out there already figured this out with Kinesis, please keep your suggestion coming. Thanks. Thanks, Wei On Mon, Aug 18, 2014 at 9:31 PM, Dibyendu Bhattacharya dibyendu.bhattach...@gmail.com wrote: Dear All, Recently I have written a Spark Kafka Consumer to solve this problem. Even we have seen issues with KafkaUtils which is using Highlevel Kafka Consumer and consumer code has no handle to offset management. The below code solves this problem, and this has is being tested in our Spark Cluster and this working fine as of now. https://github.com/dibbhatt/kafka-spark-consumer This is Low Level Kafka Consumer using Kafka Simple Consumer API. Please have a look at it and let me know your opinion. This has been written to eliminate the Data loss by committing the offset after it is written to BM. Also existing HighLevel KafkaUtils does not have any feature to control Data Flow, and is gives Out Of Memory error is there is too much backlogs in Kafka. This consumer solves this problem as well. And this code has been modified from earlier Storm Kafka consumer code and it has lot of other features like recovery from Kafka node failures, ZK failures, recover from Offset errors etc. Regards, Dibyendu On Tue, Aug 19, 2014 at 9:49 AM, Shao, Saisai saisai.s...@intel.com wrote: I think Currently Spark Streaming lack a data acknowledging mechanism when data is stored and replicated in BlockManager, so potentially data will be lost even pulled into Kafka, say if data is stored just in BlockGenerator not BM, while in the meantime Kafka itself commit the consumer offset, also at this point node is failed, from Kafka’s point this part of data is feed into Spark Streaming but actually this data is not yet processed, so potentially this part of data will never be processed again, unless you read the whole partition again. To solve this potential data loss problem, Spark Streaming needs to offer a data acknowledging mechanism, so custom Receiver can use this acknowledgement to do checkpoint or recovery, like Storm. Besides, driver failure is another story need to be carefully considered. So currently it is hard to make sure no data loss in Spark Streaming, still need to improve at some points J. Thanks Jerry *From:* Tobias Pfeiffer [mailto:t...@preferred.jp] *Sent:* Tuesday, August 19, 2014 10:47 AM *To:* Wei Liu *Cc:* user *Subject:* Re: Data loss - Spark streaming and network receiver Hi Wei, On Tue, Aug 19, 2014 at 10:18 AM, Wei Liu wei@stellarloyalty.com wrote: Since our application cannot tolerate losing customer data, I am wondering what is the best way for us to address this issue. 1) We are thinking writing application specific logic to address the data loss. To us, the problem seems to be caused by that Kinesis receivers advanced their checkpoint before we know for sure the data is replicated. For example, we can do another checkpoint ourselves to remember the kinesis sequence number for data that has been processed by spark streaming. When Kinesis receiver is restarted due to worker failures, we restarted it from the checkpoint we tracked. This sounds pretty much to me like the way Kafka does it. So, I am not saying that the stock KafkaReceiver does what you want (it may or may not), but it should be possible to update the offset (corresponds to sequence number) in Zookeeper only after data has been replicated successfully. I guess replace Kinesis by Kafka is not in option for you, but you may consider pulling Kinesis data into Kafka before processing with Spark? Tobias