Re: KafkaUtils explicit acks

2014-12-16 Thread Mukesh Jha
I agree that this is not a trivial task as in this approach the kafka ack's
will be done by the SparkTasks that means a plug-able mean to ack your
input data source i.e. changes in core.

From my limited experience with Kafka + Spark what I've seem is If spark
tasks takes longer time than the batch interval the next batch waits for
the previous one to finish, so I was wondering if offset management can be
done by spark too.

I'm just trying to figure out if this seems to be a worthwhile addition to
have?

On Mon, Dec 15, 2014 at 11:39 AM, Shao, Saisai saisai.s...@intel.com
wrote:

  Hi,



 It is not a trivial work to acknowledge the offsets when RDD is fully
 processed, I think from my understanding only modify the KafakUtils is not
 enough to meet your requirement, you need to add a metadata management
 stuff for each block/RDD, and track them both in executor-driver side, and
 many other things should also be taken care J.



 Thanks

 Jerry



 *From:* mukh@gmail.com [mailto:mukh@gmail.com] *On Behalf Of *Mukesh
 Jha
 *Sent:* Monday, December 15, 2014 1:31 PM
 *To:* Tathagata Das
 *Cc:* francois.garil...@typesafe.com; user@spark.apache.org
 *Subject:* Re: KafkaUtils explicit acks



 Thanks TD  Francois for the explanation  documentation. I'm curious if
 we have any performance benchmark with  without WAL for
 spark-streaming-kafka.



 Also In spark-streaming-kafka (as kafka provides a way to acknowledge
 logs) on top of WAL can we modify KafkaUtils to acknowledge the offsets
 only when the RRDs are fully processed and are getting evicted out of the
 Spark memory thus we can be cent percent sure that all the records are
 getting processed in the system.

 I was thinking if it's good to have the kafka offset information of each
 batch as part of RDDs metadata and commit the offsets once the RDDs lineage
 is complete.



 On Thu, Dec 11, 2014 at 6:26 PM, Tathagata Das 
 tathagata.das1...@gmail.com wrote:

 I am updating the docs right now. Here is a staged copy that you can
 have sneak peek of. This will be part of the Spark 1.2.


 http://people.apache.org/~tdas/spark-1.2-temp/streaming-programming-guide.html

 The updated fault-tolerance section tries to simplify the explanation
 of when and what data can be lost, and how to prevent that using the
 new experimental feature of write ahead logs.
 Any feedback will be much appreciated.

 TD


 On Wed, Dec 10, 2014 at 2:42 AM,  francois.garil...@typesafe.com wrote:
  [sorry for the botched half-message]
 
  Hi Mukesh,
 
  There's been some great work on Spark Streaming reliability lately.
  https://www.youtube.com/watch?v=jcJq3ZalXD8
  Look at the links from:
  https://issues.apache.org/jira/browse/SPARK-3129
 
  I'm not aware of any doc yet (did I miss something ?) but you can look at
  the ReliableKafkaReceiver's test suite:
 
 
 external/kafka/src/test/scala/org/apache/spark/streaming/kafka/ReliableKafkaStreamSuite.scala
 
  --
  FG
 
 
  On Wed, Dec 10, 2014 at 11:17 AM, Mukesh Jha me.mukesh@gmail.com
  wrote:
 
  Hello Guys,
 
  Any insights on this??
  If I'm not clear enough my question is how can I use kafka consumer and
  not loose any data in cases of failures with spark-streaming.
 
  On Tue, Dec 9, 2014 at 2:53 PM, Mukesh Jha me.mukesh@gmail.com
  wrote:
 
  Hello Experts,
 
  I'm working on a spark app which reads data from kafka  persists it in
  hbase.
 
  Spark documentation states the below [1] that in case of worker failure
  we can loose some data. If not how can I make my kafka stream more
 reliable?
  I have seen there is a simple consumer [2] but I'm not sure if it has
  been used/tested extensively.
 
  I was wondering if there is a way to explicitly acknowledge the kafka
  offsets once they are replicated in memory of other worker nodes (if
 it's
  not already done) to tackle this issue.
 
  Any help is appreciated in advance.
 
 
  Using any input source that receives data through a network - For
  network-based data sources like Kafka and Flume, the received input
 data is
  replicated in memory between nodes of the cluster (default replication
  factor is 2). So if a worker node fails, then the system can recompute
 the
  lost from the the left over copy of the input data. However, if the
 worker
  node where a network receiver was running fails, then a tiny bit of
 data may
  be lost, that is, the data received by the system but not yet
 replicated to
  other node(s). The receiver will be started on a different node and it
 will
  continue to receive data.
  https://github.com/dibbhatt/kafka-spark-consumer
 
  Txz,
 
  Mukesh Jha
 
 
 
 
  --
 
 
  Thanks  Regards,
 
  Mukesh Jha
 
 




 --





 Thanks  Regards,

 *Mukesh Jha me.mukesh@gmail.com*



-- 


Thanks  Regards,

*Mukesh Jha me.mukesh@gmail.com*


Re: KafkaUtils explicit acks

2014-12-16 Thread Cody Koeninger
Do you actually need spark streaming per se for your use case?  If you're
just trying to read data out of kafka into hbase, would something like this
non-streaming rdd work for you:

https://github.com/koeninger/spark-1/tree/kafkaRdd/external/kafka/src/main/scala/org/apache/spark/rdd/kafka

Note that if you're trying to get exactly-once semantics out of kafka, you
need either idempotent writes, or a transactional relationship between the
place you're storing data and the place you're storing offsets.  Using
normal batch rdds instead of streaming makes the second approach pretty
trivial actually.

On Tue, Dec 16, 2014 at 6:34 AM, Mukesh Jha me.mukesh@gmail.com wrote:

 I agree that this is not a trivial task as in this approach the kafka
 ack's will be done by the SparkTasks that means a plug-able mean to ack
 your input data source i.e. changes in core.

 From my limited experience with Kafka + Spark what I've seem is If spark
 tasks takes longer time than the batch interval the next batch waits for
 the previous one to finish, so I was wondering if offset management can be
 done by spark too.

 I'm just trying to figure out if this seems to be a worthwhile addition to
 have?

 On Mon, Dec 15, 2014 at 11:39 AM, Shao, Saisai saisai.s...@intel.com
 wrote:

  Hi,



 It is not a trivial work to acknowledge the offsets when RDD is fully
 processed, I think from my understanding only modify the KafakUtils is not
 enough to meet your requirement, you need to add a metadata management
 stuff for each block/RDD, and track them both in executor-driver side, and
 many other things should also be taken care J.



 Thanks

 Jerry



 *From:* mukh@gmail.com [mailto:mukh@gmail.com] *On Behalf Of *Mukesh
 Jha
 *Sent:* Monday, December 15, 2014 1:31 PM
 *To:* Tathagata Das
 *Cc:* francois.garil...@typesafe.com; user@spark.apache.org
 *Subject:* Re: KafkaUtils explicit acks



 Thanks TD  Francois for the explanation  documentation. I'm curious if
 we have any performance benchmark with  without WAL for
 spark-streaming-kafka.



 Also In spark-streaming-kafka (as kafka provides a way to acknowledge
 logs) on top of WAL can we modify KafkaUtils to acknowledge the offsets
 only when the RRDs are fully processed and are getting evicted out of the
 Spark memory thus we can be cent percent sure that all the records are
 getting processed in the system.

 I was thinking if it's good to have the kafka offset information of each
 batch as part of RDDs metadata and commit the offsets once the RDDs lineage
 is complete.



 On Thu, Dec 11, 2014 at 6:26 PM, Tathagata Das 
 tathagata.das1...@gmail.com wrote:

 I am updating the docs right now. Here is a staged copy that you can
 have sneak peek of. This will be part of the Spark 1.2.


 http://people.apache.org/~tdas/spark-1.2-temp/streaming-programming-guide.html

 The updated fault-tolerance section tries to simplify the explanation
 of when and what data can be lost, and how to prevent that using the
 new experimental feature of write ahead logs.
 Any feedback will be much appreciated.

 TD


 On Wed, Dec 10, 2014 at 2:42 AM,  francois.garil...@typesafe.com wrote:
  [sorry for the botched half-message]
 
  Hi Mukesh,
 
  There’s been some great work on Spark Streaming reliability lately.
  https://www.youtube.com/watch?v=jcJq3ZalXD8
  Look at the links from:
  https://issues.apache.org/jira/browse/SPARK-3129
 
  I’m not aware of any doc yet (did I miss something ?) but you can look
 at
  the ReliableKafkaReceiver’s test suite:
 
 
 external/kafka/src/test/scala/org/apache/spark/streaming/kafka/ReliableKafkaStreamSuite.scala
 
  —
  FG
 
 
  On Wed, Dec 10, 2014 at 11:17 AM, Mukesh Jha me.mukesh@gmail.com
  wrote:
 
  Hello Guys,
 
  Any insights on this??
  If I'm not clear enough my question is how can I use kafka consumer and
  not loose any data in cases of failures with spark-streaming.
 
  On Tue, Dec 9, 2014 at 2:53 PM, Mukesh Jha me.mukesh@gmail.com
  wrote:
 
  Hello Experts,
 
  I'm working on a spark app which reads data from kafka  persists it
 in
  hbase.
 
  Spark documentation states the below [1] that in case of worker
 failure
  we can loose some data. If not how can I make my kafka stream more
 reliable?
  I have seen there is a simple consumer [2] but I'm not sure if it has
  been used/tested extensively.
 
  I was wondering if there is a way to explicitly acknowledge the kafka
  offsets once they are replicated in memory of other worker nodes (if
 it's
  not already done) to tackle this issue.
 
  Any help is appreciated in advance.
 
 
  Using any input source that receives data through a network - For
  network-based data sources like Kafka and Flume, the received input
 data is
  replicated in memory between nodes of the cluster (default replication
  factor is 2). So if a worker node fails, then the system can
 recompute the
  lost from the the left over copy of the input data. However, if the
 worker
  node where a network receiver

Re: KafkaUtils explicit acks

2014-12-14 Thread Mukesh Jha
Thanks TD  Francois for the explanation  documentation. I'm curious if we
have any performance benchmark with  without WAL for spark-streaming-kafka.

Also In spark-streaming-kafka (as kafka provides a way to acknowledge logs)
on top of WAL can we modify KafkaUtils to acknowledge the offsets only when
the RRDs are fully processed and are getting evicted out of the Spark
memory thus we can be cent percent sure that all the records are getting
processed in the system.
I was thinking if it's good to have the kafka offset information of each
batch as part of RDDs metadata and commit the offsets once the RDDs lineage
is complete.

On Thu, Dec 11, 2014 at 6:26 PM, Tathagata Das tathagata.das1...@gmail.com
wrote:

 I am updating the docs right now. Here is a staged copy that you can
 have sneak peek of. This will be part of the Spark 1.2.


 http://people.apache.org/~tdas/spark-1.2-temp/streaming-programming-guide.html

 The updated fault-tolerance section tries to simplify the explanation
 of when and what data can be lost, and how to prevent that using the
 new experimental feature of write ahead logs.
 Any feedback will be much appreciated.

 TD

 On Wed, Dec 10, 2014 at 2:42 AM,  francois.garil...@typesafe.com wrote:
  [sorry for the botched half-message]
 
  Hi Mukesh,
 
  There's been some great work on Spark Streaming reliability lately.
  https://www.youtube.com/watch?v=jcJq3ZalXD8
  Look at the links from:
  https://issues.apache.org/jira/browse/SPARK-3129
 
  I'm not aware of any doc yet (did I miss something ?) but you can look at
  the ReliableKafkaReceiver's test suite:
 
 
 external/kafka/src/test/scala/org/apache/spark/streaming/kafka/ReliableKafkaStreamSuite.scala
 
  --
  FG
 
 
  On Wed, Dec 10, 2014 at 11:17 AM, Mukesh Jha me.mukesh@gmail.com
  wrote:
 
  Hello Guys,
 
  Any insights on this??
  If I'm not clear enough my question is how can I use kafka consumer and
  not loose any data in cases of failures with spark-streaming.
 
  On Tue, Dec 9, 2014 at 2:53 PM, Mukesh Jha me.mukesh@gmail.com
  wrote:
 
  Hello Experts,
 
  I'm working on a spark app which reads data from kafka  persists it in
  hbase.
 
  Spark documentation states the below [1] that in case of worker failure
  we can loose some data. If not how can I make my kafka stream more
 reliable?
  I have seen there is a simple consumer [2] but I'm not sure if it has
  been used/tested extensively.
 
  I was wondering if there is a way to explicitly acknowledge the kafka
  offsets once they are replicated in memory of other worker nodes (if
 it's
  not already done) to tackle this issue.
 
  Any help is appreciated in advance.
 
 
  Using any input source that receives data through a network - For
  network-based data sources like Kafka and Flume, the received input
 data is
  replicated in memory between nodes of the cluster (default replication
  factor is 2). So if a worker node fails, then the system can recompute
 the
  lost from the the left over copy of the input data. However, if the
 worker
  node where a network receiver was running fails, then a tiny bit of
 data may
  be lost, that is, the data received by the system but not yet
 replicated to
  other node(s). The receiver will be started on a different node and it
 will
  continue to receive data.
  https://github.com/dibbhatt/kafka-spark-consumer
 
  Txz,
 
  Mukesh Jha
 
 
 
 
  --
 
 
  Thanks  Regards,
 
  Mukesh Jha
 
 



-- 


Thanks  Regards,

*Mukesh Jha me.mukesh@gmail.com*


RE: KafkaUtils explicit acks

2014-12-14 Thread Shao, Saisai
Hi,

It is not a trivial work to acknowledge the offsets when RDD is fully 
processed, I think from my understanding only modify the KafakUtils is not 
enough to meet your requirement, you need to add a metadata management stuff 
for each block/RDD, and track them both in executor-driver side, and many other 
things should also be taken care :).

Thanks
Jerry

From: mukh@gmail.com [mailto:mukh@gmail.com] On Behalf Of Mukesh Jha
Sent: Monday, December 15, 2014 1:31 PM
To: Tathagata Das
Cc: francois.garil...@typesafe.com; user@spark.apache.org
Subject: Re: KafkaUtils explicit acks

Thanks TD  Francois for the explanation  documentation. I'm curious if we 
have any performance benchmark with  without WAL for spark-streaming-kafka.

Also In spark-streaming-kafka (as kafka provides a way to acknowledge logs) on 
top of WAL can we modify KafkaUtils to acknowledge the offsets only when the 
RRDs are fully processed and are getting evicted out of the Spark memory thus 
we can be cent percent sure that all the records are getting processed in the 
system.
I was thinking if it's good to have the kafka offset information of each batch 
as part of RDDs metadata and commit the offsets once the RDDs lineage is 
complete.

On Thu, Dec 11, 2014 at 6:26 PM, Tathagata Das 
tathagata.das1...@gmail.commailto:tathagata.das1...@gmail.com wrote:
I am updating the docs right now. Here is a staged copy that you can
have sneak peek of. This will be part of the Spark 1.2.

http://people.apache.org/~tdas/spark-1.2-temp/streaming-programming-guide.html

The updated fault-tolerance section tries to simplify the explanation
of when and what data can be lost, and how to prevent that using the
new experimental feature of write ahead logs.
Any feedback will be much appreciated.

TD

On Wed, Dec 10, 2014 at 2:42 AM,  
francois.garil...@typesafe.commailto:francois.garil...@typesafe.com wrote:
 [sorry for the botched half-message]

 Hi Mukesh,

 There's been some great work on Spark Streaming reliability lately.
 https://www.youtube.com/watch?v=jcJq3ZalXD8
 Look at the links from:
 https://issues.apache.org/jira/browse/SPARK-3129

 I'm not aware of any doc yet (did I miss something ?) but you can look at
 the ReliableKafkaReceiver's test suite:

 external/kafka/src/test/scala/org/apache/spark/streaming/kafka/ReliableKafkaStreamSuite.scala

 -
 FG


 On Wed, Dec 10, 2014 at 11:17 AM, Mukesh Jha 
 me.mukesh@gmail.commailto:me.mukesh@gmail.com
 wrote:

 Hello Guys,

 Any insights on this??
 If I'm not clear enough my question is how can I use kafka consumer and
 not loose any data in cases of failures with spark-streaming.

 On Tue, Dec 9, 2014 at 2:53 PM, Mukesh Jha 
 me.mukesh@gmail.commailto:me.mukesh@gmail.com
 wrote:

 Hello Experts,

 I'm working on a spark app which reads data from kafka  persists it in
 hbase.

 Spark documentation states the below [1] that in case of worker failure
 we can loose some data. If not how can I make my kafka stream more reliable?
 I have seen there is a simple consumer [2] but I'm not sure if it has
 been used/tested extensively.

 I was wondering if there is a way to explicitly acknowledge the kafka
 offsets once they are replicated in memory of other worker nodes (if it's
 not already done) to tackle this issue.

 Any help is appreciated in advance.


 Using any input source that receives data through a network - For
 network-based data sources like Kafka and Flume, the received input data is
 replicated in memory between nodes of the cluster (default replication
 factor is 2). So if a worker node fails, then the system can recompute the
 lost from the the left over copy of the input data. However, if the worker
 node where a network receiver was running fails, then a tiny bit of data may
 be lost, that is, the data received by the system but not yet replicated to
 other node(s). The receiver will be started on a different node and it will
 continue to receive data.
 https://github.com/dibbhatt/kafka-spark-consumer

 Txz,

 Mukesh Jha




 --


 Thanks  Regards,

 Mukesh Jha




--


Thanks  Regards,

Mukesh Jhamailto:me.mukesh@gmail.com


Re: KafkaUtils explicit acks

2014-12-11 Thread Tathagata Das
I am updating the docs right now. Here is a staged copy that you can
have sneak peek of. This will be part of the Spark 1.2.

http://people.apache.org/~tdas/spark-1.2-temp/streaming-programming-guide.html

The updated fault-tolerance section tries to simplify the explanation
of when and what data can be lost, and how to prevent that using the
new experimental feature of write ahead logs.
Any feedback will be much appreciated.

TD

On Wed, Dec 10, 2014 at 2:42 AM,  francois.garil...@typesafe.com wrote:
 [sorry for the botched half-message]

 Hi Mukesh,

 There’s been some great work on Spark Streaming reliability lately.
 https://www.youtube.com/watch?v=jcJq3ZalXD8
 Look at the links from:
 https://issues.apache.org/jira/browse/SPARK-3129

 I’m not aware of any doc yet (did I miss something ?) but you can look at
 the ReliableKafkaReceiver’s test suite:

 external/kafka/src/test/scala/org/apache/spark/streaming/kafka/ReliableKafkaStreamSuite.scala

 —
 FG


 On Wed, Dec 10, 2014 at 11:17 AM, Mukesh Jha me.mukesh@gmail.com
 wrote:

 Hello Guys,

 Any insights on this??
 If I'm not clear enough my question is how can I use kafka consumer and
 not loose any data in cases of failures with spark-streaming.

 On Tue, Dec 9, 2014 at 2:53 PM, Mukesh Jha me.mukesh@gmail.com
 wrote:

 Hello Experts,

 I'm working on a spark app which reads data from kafka  persists it in
 hbase.

 Spark documentation states the below [1] that in case of worker failure
 we can loose some data. If not how can I make my kafka stream more reliable?
 I have seen there is a simple consumer [2] but I'm not sure if it has
 been used/tested extensively.

 I was wondering if there is a way to explicitly acknowledge the kafka
 offsets once they are replicated in memory of other worker nodes (if it's
 not already done) to tackle this issue.

 Any help is appreciated in advance.


 Using any input source that receives data through a network - For
 network-based data sources like Kafka and Flume, the received input data is
 replicated in memory between nodes of the cluster (default replication
 factor is 2). So if a worker node fails, then the system can recompute the
 lost from the the left over copy of the input data. However, if the worker
 node where a network receiver was running fails, then a tiny bit of data may
 be lost, that is, the data received by the system but not yet replicated to
 other node(s). The receiver will be started on a different node and it will
 continue to receive data.
 https://github.com/dibbhatt/kafka-spark-consumer

 Txz,

 Mukesh Jha




 --


 Thanks  Regards,

 Mukesh Jha



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: KafkaUtils explicit acks

2014-12-10 Thread Mukesh Jha
Hello Guys,

Any insights on this??
If I'm not clear enough my question is how can I use kafka consumer and not
loose any data in cases of failures with spark-streaming.

On Tue, Dec 9, 2014 at 2:53 PM, Mukesh Jha me.mukesh@gmail.com wrote:

 Hello Experts,

 I'm working on a spark app which reads data from kafka  persists it in
 hbase.

 Spark documentation states the below *[1]* that in case of worker failure
 we can loose some data. If not how can I make my kafka stream more reliable?
 I have seen there is a simple consumer *[2]* but I'm not sure if it has
 been used/tested extensively.

 I was wondering if there is a way to explicitly acknowledge the kafka
 offsets once they are replicated in memory of other worker nodes (if it's
 not already done) to tackle this issue.

 Any help is appreciated in advance.


1. *Using any input source that receives data through a network* - For
network-based data sources like *Kafka *and Flume, the received input
data is replicated in memory between nodes of the cluster (default
replication factor is 2). So if a worker node fails, then the system can
recompute the lost from the the left over copy of the input data. However,
if the *worker node where a network receiver was running fails, then a
tiny bit of data may be lost*, that is, the data received by the
system but not yet replicated to other node(s). The receiver will be
started on a different node and it will continue to receive data.
2. https://github.com/dibbhatt/kafka-spark-consumer

 Txz,

 *Mukesh Jha me.mukesh@gmail.com*




-- 


Thanks  Regards,

*Mukesh Jha me.mukesh@gmail.com*


Re: KafkaUtils explicit acks

2014-12-10 Thread francois . garillot
Hi Mukesh,




There’s been some great work on Spark Streaming reliability lately




I’m not aware of any doc yet (did I miss something ?) but you can look at the 
ReliableKafkaReceiver’s test suite:





—
FG

On Wed, Dec 10, 2014 at 11:17 AM, Mukesh Jha me.mukesh@gmail.com
wrote:

 Hello Guys,
 Any insights on this??
 If I'm not clear enough my question is how can I use kafka consumer and not
 loose any data in cases of failures with spark-streaming.
 On Tue, Dec 9, 2014 at 2:53 PM, Mukesh Jha me.mukesh@gmail.com wrote:
 Hello Experts,

 I'm working on a spark app which reads data from kafka  persists it in
 hbase.

 Spark documentation states the below *[1]* that in case of worker failure
 we can loose some data. If not how can I make my kafka stream more reliable?
 I have seen there is a simple consumer *[2]* but I'm not sure if it has
 been used/tested extensively.

 I was wondering if there is a way to explicitly acknowledge the kafka
 offsets once they are replicated in memory of other worker nodes (if it's
 not already done) to tackle this issue.

 Any help is appreciated in advance.


1. *Using any input source that receives data through a network* - For
network-based data sources like *Kafka *and Flume, the received input
data is replicated in memory between nodes of the cluster (default
replication factor is 2). So if a worker node fails, then the system can
recompute the lost from the the left over copy of the input data. However,
if the *worker node where a network receiver was running fails, then a
tiny bit of data may be lost*, that is, the data received by the
system but not yet replicated to other node(s). The receiver will be
started on a different node and it will continue to receive data.
2. https://github.com/dibbhatt/kafka-spark-consumer

 Txz,

 *Mukesh Jha me.mukesh@gmail.com*

 -- 
 Thanks  Regards,
 *Mukesh Jha me.mukesh@gmail.com*

Re: KafkaUtils explicit acks

2014-12-10 Thread francois . garillot
[sorry for the botched half-message]




Hi Mukesh,




There’s been some great work on Spark Streaming reliability lately.

https://www.youtube.com/watch?v=jcJq3ZalXD8


Look at the links from:

https://issues.apache.org/jira/browse/SPARK-3129








I’m not aware of any doc yet (did I miss something ?) but you can look at the 
ReliableKafkaReceiver’s test suite:






external/kafka/src/test/scala/org/apache/spark/streaming/kafka/ReliableKafkaStreamSuite.scala


—
FG

On Wed, Dec 10, 2014 at 11:17 AM, Mukesh Jha me.mukesh@gmail.com
wrote:

 Hello Guys,
 Any insights on this??
 If I'm not clear enough my question is how can I use kafka consumer and not
 loose any data in cases of failures with spark-streaming.
 On Tue, Dec 9, 2014 at 2:53 PM, Mukesh Jha me.mukesh@gmail.com wrote:
 Hello Experts,

 I'm working on a spark app which reads data from kafka  persists it in
 hbase.

 Spark documentation states the below *[1]* that in case of worker failure
 we can loose some data. If not how can I make my kafka stream more reliable?
 I have seen there is a simple consumer *[2]* but I'm not sure if it has
 been used/tested extensively.

 I was wondering if there is a way to explicitly acknowledge the kafka
 offsets once they are replicated in memory of other worker nodes (if it's
 not already done) to tackle this issue.

 Any help is appreciated in advance.


1. *Using any input source that receives data through a network* - For
network-based data sources like *Kafka *and Flume, the received input
data is replicated in memory between nodes of the cluster (default
replication factor is 2). So if a worker node fails, then the system can
recompute the lost from the the left over copy of the input data. However,
if the *worker node where a network receiver was running fails, then a
tiny bit of data may be lost*, that is, the data received by the
system but not yet replicated to other node(s). The receiver will be
started on a different node and it will continue to receive data.
2. https://github.com/dibbhatt/kafka-spark-consumer

 Txz,

 *Mukesh Jha me.mukesh@gmail.com*

 -- 
 Thanks  Regards,
 *Mukesh Jha me.mukesh@gmail.com*