Re: spark kafka batch integration

Koert Kuipers Sun, 14 Dec 2014 21:06:07 -0800

hey gwen,

no immediate plans to contribute it to spark but of course we are open to
this. given sparks pullreq backlog my suspicion is that spark community
prefers a user library at this point.

if you lose a node the task will restart. and since each task reads until
the end of a kafka partition, which is somewhat of a moving target, the
resulting data will not be the same (it will include whatever was written
to the partition in the meantime). i am not sure if there is an elegant way
to implement this where the data would be the same upon a task restart.

if you need the data read to be the same upon retry this can be done with a
transformation on the rdd in some cases. for example if you need data
exactly up to midnight you can include a timestamp in the data, and start
the KafkaRDD sometime just after midnight, and then filter to remove any
data with a timestamp after midnight. now the filtered rdd will be the same
even if there is a node failure.

On Sun, Dec 14, 2014 at 8:27 PM, Gwen Shapira <gshap...@cloudera.com> wrote:
>
> Thank you, this is really cool! Are you planning on contributing this to
> Spark?
>
> Another question? What's the behavior if I lose a node while my Spark
> App is running? Will the RDD recovery process get the exact same data
> from Kafka as the original? even if we wrote additional data to Kafka
> in the mean time?
>
> Gwen
>
> On Sun, Dec 14, 2014 at 5:22 PM, Koert Kuipers <ko...@tresata.com> wrote:
> > hello all,
> > we at tresata wrote a library to provide for batch integration between
> > spark and kafka. it supports:
> > * distributed write of rdd to kafa
> > * distributed read of rdd from kafka
> >
> > our main use cases are (in lambda architecture speak):
> > * periodic appends to the immutable master dataset on hdfs from kafka
> using
> > spark
> > * make non-streaming data available in kafka with periodic data drops
> from
> > hdfs using spark. this is to facilitate merging the speed and batch
> layers
> > in spark-streaming
> > * distributed writes from spark-streaming
> >
> > see here:
> > https://github.com/tresata/spark-kafka
> >
> > best,
> > koert
>

Re: spark kafka batch integration

Reply via email to