Hi Cody, That is clear. Thanks!
Bill On Tue, May 19, 2015 at 1:27 PM, Cody Koeninger <c...@koeninger.org> wrote: > If you checkpoint, the job will start from the successfully consumed > offsets. If you don't checkpoint, by default it will start from the > highest available offset, and you will potentially lose data. > > Is the link I posted, or for that matter the scaladoc, really not clear on > that point? > > The scaladoc says: > > To recover from driver failures, you have to enable checkpointing in the > StreamingContext > <http://spark.apache.org/docs/latest/api/scala/org/apache/spark/streaming/StreamingContext.html>. > The information on consumed offset can be recovered from the checkpoint. > > On Tue, May 19, 2015 at 2:38 PM, Bill Jay <bill.jaypeter...@gmail.com> > wrote: > >> If a Spark streaming job stops at 12:01 and I resume the job at 12:02. >> Will it still start to consume the data that were produced to Kafka at >> 12:01? Or it will just start consuming from the current time? >> >> >> On Tue, May 19, 2015 at 10:58 AM, Cody Koeninger <c...@koeninger.org> >> wrote: >> >>> Have you read >>> https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md >>> ? >>> >>> 1. There's nothing preventing that. >>> >>> 2. Checkpointing will give you at-least-once semantics, provided you >>> have sufficient kafka retention. Be aware that checkpoints aren't >>> recoverable if you upgrade code. >>> >>> On Tue, May 19, 2015 at 12:42 PM, Bill Jay <bill.jaypeter...@gmail.com> >>> wrote: >>> >>>> Hi all, >>>> >>>> I am currently using Spark streaming to consume and save logs every >>>> hour in our production pipeline. The current setting is to run a crontab >>>> job to check every minute whether the job is still there and if not >>>> resubmit a Spark streaming job. I am currently using the direct approach >>>> for Kafka consumer. I have two questions: >>>> >>>> 1. In the direct approach, no offset is stored in zookeeper and no >>>> group id is specified. Can two consumers (one is Spark streaming and the >>>> other is a Kafak console consumer in Kafka package) read from the same >>>> topic from the brokers together (I would like both of them to get all >>>> messages, i.e. publish-subscribe mode)? What about two Spark streaming jobs >>>> read from the same topic? >>>> >>>> 2. How to avoid data loss if a Spark job is killed? Does checkpointing >>>> serve this purpose? The default behavior of Spark streaming is to read the >>>> latest logs. However, if a job is killed, can the new job resume from what >>>> was left to avoid loosing logs? >>>> >>>> Thanks! >>>> >>>> Bill >>>> >>> >>> >> >