I agree that this is not a trivial task as in this approach the kafka ack's will be done by the SparkTasks that means a plug-able mean to ack your input data source i.e. changes in core.
>From my limited experience with Kafka + Spark what I've seem is If spark tasks takes longer time than the batch interval the next batch waits for the previous one to finish, so I was wondering if offset management can be done by spark too. I'm just trying to figure out if this seems to be a worthwhile addition to have? On Mon, Dec 15, 2014 at 11:39 AM, Shao, Saisai <saisai.s...@intel.com> wrote: > > Hi, > > > > It is not a trivial work to acknowledge the offsets when RDD is fully > processed, I think from my understanding only modify the KafakUtils is not > enough to meet your requirement, you need to add a metadata management > stuff for each block/RDD, and track them both in executor-driver side, and > many other things should also be taken care J. > > > > Thanks > > Jerry > > > > *From:* mukh....@gmail.com [mailto:mukh....@gmail.com] *On Behalf Of *Mukesh > Jha > *Sent:* Monday, December 15, 2014 1:31 PM > *To:* Tathagata Das > *Cc:* francois.garil...@typesafe.com; user@spark.apache.org > *Subject:* Re: KafkaUtils explicit acks > > > > Thanks TD & Francois for the explanation & documentation. I'm curious if > we have any performance benchmark with & without WAL for > spark-streaming-kafka. > > > > Also In spark-streaming-kafka (as kafka provides a way to acknowledge > logs) on top of WAL can we modify KafkaUtils to acknowledge the offsets > only when the RRDs are fully processed and are getting evicted out of the > Spark memory thus we can be cent percent sure that all the records are > getting processed in the system. > > I was thinking if it's good to have the kafka offset information of each > batch as part of RDDs metadata and commit the offsets once the RDDs lineage > is complete. > > > > On Thu, Dec 11, 2014 at 6:26 PM, Tathagata Das < > tathagata.das1...@gmail.com> wrote: > > I am updating the docs right now. Here is a staged copy that you can > have sneak peek of. This will be part of the Spark 1.2. > > > http://people.apache.org/~tdas/spark-1.2-temp/streaming-programming-guide.html > > The updated fault-tolerance section tries to simplify the explanation > of when and what data can be lost, and how to prevent that using the > new experimental feature of write ahead logs. > Any feedback will be much appreciated. > > TD > > > On Wed, Dec 10, 2014 at 2:42 AM, <francois.garil...@typesafe.com> wrote: > > [sorry for the botched half-message] > > > > Hi Mukesh, > > > > There's been some great work on Spark Streaming reliability lately. > > https://www.youtube.com/watch?v=jcJq3ZalXD8 > > Look at the links from: > > https://issues.apache.org/jira/browse/SPARK-3129 > > > > I'm not aware of any doc yet (did I miss something ?) but you can look at > > the ReliableKafkaReceiver's test suite: > > > > > external/kafka/src/test/scala/org/apache/spark/streaming/kafka/ReliableKafkaStreamSuite.scala > > > > -- > > FG > > > > > > On Wed, Dec 10, 2014 at 11:17 AM, Mukesh Jha <me.mukesh....@gmail.com> > > wrote: > >> > >> Hello Guys, > >> > >> Any insights on this?? > >> If I'm not clear enough my question is how can I use kafka consumer and > >> not loose any data in cases of failures with spark-streaming. > >> > >> On Tue, Dec 9, 2014 at 2:53 PM, Mukesh Jha <me.mukesh....@gmail.com> > >> wrote: > >>> > >>> Hello Experts, > >>> > >>> I'm working on a spark app which reads data from kafka & persists it in > >>> hbase. > >>> > >>> Spark documentation states the below [1] that in case of worker failure > >>> we can loose some data. If not how can I make my kafka stream more > reliable? > >>> I have seen there is a simple consumer [2] but I'm not sure if it has > >>> been used/tested extensively. > >>> > >>> I was wondering if there is a way to explicitly acknowledge the kafka > >>> offsets once they are replicated in memory of other worker nodes (if > it's > >>> not already done) to tackle this issue. > >>> > >>> Any help is appreciated in advance. > >>> > >>> > >>> Using any input source that receives data through a network - For > >>> network-based data sources like Kafka and Flume, the received input > data is > >>> replicated in memory between nodes of the cluster (default replication > >>> factor is 2). So if a worker node fails, then the system can recompute > the > >>> lost from the the left over copy of the input data. However, if the > worker > >>> node where a network receiver was running fails, then a tiny bit of > data may > >>> be lost, that is, the data received by the system but not yet > replicated to > >>> other node(s). The receiver will be started on a different node and it > will > >>> continue to receive data. > >>> https://github.com/dibbhatt/kafka-spark-consumer > >>> > >>> Txz, > >>> > >>> Mukesh Jha > >> > >> > >> > >> > >> -- > >> > >> > >> Thanks & Regards, > >> > >> Mukesh Jha > > > > > > > > > -- > > > > > > Thanks & Regards, > > *Mukesh Jha <me.mukesh....@gmail.com>* > -- Thanks & Regards, *Mukesh Jha <me.mukesh....@gmail.com>*