I agree that this is not a trivial task as in this approach the kafka ack's
will be done by the SparkTasks that means a plug-able mean to ack your
input data source i.e. changes in core.

>From my limited experience with Kafka + Spark what I've seem is If spark
tasks takes longer time than the batch interval the next batch waits for
the previous one to finish, so I was wondering if offset management can be
done by spark too.

I'm just trying to figure out if this seems to be a worthwhile addition to
have?

On Mon, Dec 15, 2014 at 11:39 AM, Shao, Saisai <saisai.s...@intel.com>
wrote:
>
>  Hi,
>
>
>
> It is not a trivial work to acknowledge the offsets when RDD is fully
> processed, I think from my understanding only modify the KafakUtils is not
> enough to meet your requirement, you need to add a metadata management
> stuff for each block/RDD, and track them both in executor-driver side, and
> many other things should also be taken care J.
>
>
>
> Thanks
>
> Jerry
>
>
>
> *From:* mukh....@gmail.com [mailto:mukh....@gmail.com] *On Behalf Of *Mukesh
> Jha
> *Sent:* Monday, December 15, 2014 1:31 PM
> *To:* Tathagata Das
> *Cc:* francois.garil...@typesafe.com; user@spark.apache.org
> *Subject:* Re: KafkaUtils explicit acks
>
>
>
> Thanks TD & Francois for the explanation & documentation. I'm curious if
> we have any performance benchmark with & without WAL for
> spark-streaming-kafka.
>
>
>
> Also In spark-streaming-kafka (as kafka provides a way to acknowledge
> logs) on top of WAL can we modify KafkaUtils to acknowledge the offsets
> only when the RRDs are fully processed and are getting evicted out of the
> Spark memory thus we can be cent percent sure that all the records are
> getting processed in the system.
>
> I was thinking if it's good to have the kafka offset information of each
> batch as part of RDDs metadata and commit the offsets once the RDDs lineage
> is complete.
>
>
>
> On Thu, Dec 11, 2014 at 6:26 PM, Tathagata Das <
> tathagata.das1...@gmail.com> wrote:
>
> I am updating the docs right now. Here is a staged copy that you can
> have sneak peek of. This will be part of the Spark 1.2.
>
>
> http://people.apache.org/~tdas/spark-1.2-temp/streaming-programming-guide.html
>
> The updated fault-tolerance section tries to simplify the explanation
> of when and what data can be lost, and how to prevent that using the
> new experimental feature of write ahead logs.
> Any feedback will be much appreciated.
>
> TD
>
>
> On Wed, Dec 10, 2014 at 2:42 AM,  <francois.garil...@typesafe.com> wrote:
> > [sorry for the botched half-message]
> >
> > Hi Mukesh,
> >
> > There's been some great work on Spark Streaming reliability lately.
> > https://www.youtube.com/watch?v=jcJq3ZalXD8
> > Look at the links from:
> > https://issues.apache.org/jira/browse/SPARK-3129
> >
> > I'm not aware of any doc yet (did I miss something ?) but you can look at
> > the ReliableKafkaReceiver's test suite:
> >
> >
> external/kafka/src/test/scala/org/apache/spark/streaming/kafka/ReliableKafkaStreamSuite.scala
> >
> > --
> > FG
> >
> >
> > On Wed, Dec 10, 2014 at 11:17 AM, Mukesh Jha <me.mukesh....@gmail.com>
> > wrote:
> >>
> >> Hello Guys,
> >>
> >> Any insights on this??
> >> If I'm not clear enough my question is how can I use kafka consumer and
> >> not loose any data in cases of failures with spark-streaming.
> >>
> >> On Tue, Dec 9, 2014 at 2:53 PM, Mukesh Jha <me.mukesh....@gmail.com>
> >> wrote:
> >>>
> >>> Hello Experts,
> >>>
> >>> I'm working on a spark app which reads data from kafka & persists it in
> >>> hbase.
> >>>
> >>> Spark documentation states the below [1] that in case of worker failure
> >>> we can loose some data. If not how can I make my kafka stream more
> reliable?
> >>> I have seen there is a simple consumer [2] but I'm not sure if it has
> >>> been used/tested extensively.
> >>>
> >>> I was wondering if there is a way to explicitly acknowledge the kafka
> >>> offsets once they are replicated in memory of other worker nodes (if
> it's
> >>> not already done) to tackle this issue.
> >>>
> >>> Any help is appreciated in advance.
> >>>
> >>>
> >>> Using any input source that receives data through a network - For
> >>> network-based data sources like Kafka and Flume, the received input
> data is
> >>> replicated in memory between nodes of the cluster (default replication
> >>> factor is 2). So if a worker node fails, then the system can recompute
> the
> >>> lost from the the left over copy of the input data. However, if the
> worker
> >>> node where a network receiver was running fails, then a tiny bit of
> data may
> >>> be lost, that is, the data received by the system but not yet
> replicated to
> >>> other node(s). The receiver will be started on a different node and it
> will
> >>> continue to receive data.
> >>> https://github.com/dibbhatt/kafka-spark-consumer
> >>>
> >>> Txz,
> >>>
> >>> Mukesh Jha
> >>
> >>
> >>
> >>
> >> --
> >>
> >>
> >> Thanks & Regards,
> >>
> >> Mukesh Jha
> >
> >
>
>
>
>
> --
>
>
>
>
>
> Thanks & Regards,
>
> *Mukesh Jha <me.mukesh....@gmail.com>*
>


-- 


Thanks & Regards,

*Mukesh Jha <me.mukesh....@gmail.com>*

Reply via email to