Re: Spark Streaming + Kafka failure recovery

2015-05-21 Thread Bill Jay
Hi Cody,

That is clear. Thanks!

Bill

On Tue, May 19, 2015 at 1:27 PM, Cody Koeninger c...@koeninger.org wrote:

 If you checkpoint, the job will start from the successfully consumed
 offsets.  If you don't checkpoint, by default it will start from the
 highest available offset, and you will potentially lose data.

 Is the link I posted, or for that matter the scaladoc, really not clear on
 that point?

 The scaladoc says:

  To recover from driver failures, you have to enable checkpointing in the
 StreamingContext
 http://spark.apache.org/docs/latest/api/scala/org/apache/spark/streaming/StreamingContext.html.
 The information on consumed offset can be recovered from the checkpoint.

 On Tue, May 19, 2015 at 2:38 PM, Bill Jay bill.jaypeter...@gmail.com
 wrote:

 If a Spark streaming job stops at 12:01 and I resume the job at 12:02.
 Will it still start to consume the data that were produced to Kafka at
 12:01? Or it will just start consuming from the current time?


 On Tue, May 19, 2015 at 10:58 AM, Cody Koeninger c...@koeninger.org
 wrote:

 Have you read
 https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md
 ?

 1.  There's nothing preventing that.

 2. Checkpointing will give you at-least-once semantics, provided you
 have sufficient kafka retention.  Be aware that checkpoints aren't
 recoverable if you upgrade code.

 On Tue, May 19, 2015 at 12:42 PM, Bill Jay bill.jaypeter...@gmail.com
 wrote:

 Hi all,

 I am currently using Spark streaming to consume and save logs every
 hour in our production pipeline. The current setting is to run a crontab
 job to check every minute whether the job is still there and if not
 resubmit a Spark streaming job. I am currently using the direct approach
 for Kafka consumer. I have two questions:

 1. In the direct approach, no offset is stored in zookeeper and no
 group id is specified. Can two consumers (one is Spark streaming and the
 other is a Kafak console consumer in Kafka package) read from the same
 topic from the brokers together (I would like both of them to get all
 messages, i.e. publish-subscribe mode)? What about two Spark streaming jobs
 read from the same topic?

 2. How to avoid data loss if a Spark job is killed? Does checkpointing
 serve this purpose? The default behavior of Spark streaming is to read the
 latest logs. However, if a job is killed, can the new job resume from what
 was left to avoid loosing logs?

 Thanks!

 Bill







Re: Spark Streaming + Kafka failure recovery

2015-05-19 Thread Cody Koeninger
If you checkpoint, the job will start from the successfully consumed
offsets.  If you don't checkpoint, by default it will start from the
highest available offset, and you will potentially lose data.

Is the link I posted, or for that matter the scaladoc, really not clear on
that point?

The scaladoc says:

 To recover from driver failures, you have to enable checkpointing in the
StreamingContext
http://spark.apache.org/docs/latest/api/scala/org/apache/spark/streaming/StreamingContext.html.
The information on consumed offset can be recovered from the checkpoint.

On Tue, May 19, 2015 at 2:38 PM, Bill Jay bill.jaypeter...@gmail.com
wrote:

 If a Spark streaming job stops at 12:01 and I resume the job at 12:02.
 Will it still start to consume the data that were produced to Kafka at
 12:01? Or it will just start consuming from the current time?


 On Tue, May 19, 2015 at 10:58 AM, Cody Koeninger c...@koeninger.org
 wrote:

 Have you read
 https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md ?

 1.  There's nothing preventing that.

 2. Checkpointing will give you at-least-once semantics, provided you have
 sufficient kafka retention.  Be aware that checkpoints aren't recoverable
 if you upgrade code.

 On Tue, May 19, 2015 at 12:42 PM, Bill Jay bill.jaypeter...@gmail.com
 wrote:

 Hi all,

 I am currently using Spark streaming to consume and save logs every hour
 in our production pipeline. The current setting is to run a crontab job to
 check every minute whether the job is still there and if not resubmit a
 Spark streaming job. I am currently using the direct approach for Kafka
 consumer. I have two questions:

 1. In the direct approach, no offset is stored in zookeeper and no group
 id is specified. Can two consumers (one is Spark streaming and the other is
 a Kafak console consumer in Kafka package) read from the same topic from
 the brokers together (I would like both of them to get all messages, i.e.
 publish-subscribe mode)? What about two Spark streaming jobs read from the
 same topic?

 2. How to avoid data loss if a Spark job is killed? Does checkpointing
 serve this purpose? The default behavior of Spark streaming is to read the
 latest logs. However, if a job is killed, can the new job resume from what
 was left to avoid loosing logs?

 Thanks!

 Bill






Re: Spark Streaming + Kafka failure recovery

2015-05-19 Thread Cody Koeninger
Have you read
https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md ?

1.  There's nothing preventing that.

2. Checkpointing will give you at-least-once semantics, provided you have
sufficient kafka retention.  Be aware that checkpoints aren't recoverable
if you upgrade code.

On Tue, May 19, 2015 at 12:42 PM, Bill Jay bill.jaypeter...@gmail.com
wrote:

 Hi all,

 I am currently using Spark streaming to consume and save logs every hour
 in our production pipeline. The current setting is to run a crontab job to
 check every minute whether the job is still there and if not resubmit a
 Spark streaming job. I am currently using the direct approach for Kafka
 consumer. I have two questions:

 1. In the direct approach, no offset is stored in zookeeper and no group
 id is specified. Can two consumers (one is Spark streaming and the other is
 a Kafak console consumer in Kafka package) read from the same topic from
 the brokers together (I would like both of them to get all messages, i.e.
 publish-subscribe mode)? What about two Spark streaming jobs read from the
 same topic?

 2. How to avoid data loss if a Spark job is killed? Does checkpointing
 serve this purpose? The default behavior of Spark streaming is to read the
 latest logs. However, if a job is killed, can the new job resume from what
 was left to avoid loosing logs?

 Thanks!

 Bill



Re: Spark Streaming + Kafka failure recovery

2015-05-19 Thread Bill Jay
If a Spark streaming job stops at 12:01 and I resume the job at 12:02. Will
it still start to consume the data that were produced to Kafka at 12:01? Or
it will just start consuming from the current time?


On Tue, May 19, 2015 at 10:58 AM, Cody Koeninger c...@koeninger.org wrote:

 Have you read
 https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md ?

 1.  There's nothing preventing that.

 2. Checkpointing will give you at-least-once semantics, provided you have
 sufficient kafka retention.  Be aware that checkpoints aren't recoverable
 if you upgrade code.

 On Tue, May 19, 2015 at 12:42 PM, Bill Jay bill.jaypeter...@gmail.com
 wrote:

 Hi all,

 I am currently using Spark streaming to consume and save logs every hour
 in our production pipeline. The current setting is to run a crontab job to
 check every minute whether the job is still there and if not resubmit a
 Spark streaming job. I am currently using the direct approach for Kafka
 consumer. I have two questions:

 1. In the direct approach, no offset is stored in zookeeper and no group
 id is specified. Can two consumers (one is Spark streaming and the other is
 a Kafak console consumer in Kafka package) read from the same topic from
 the brokers together (I would like both of them to get all messages, i.e.
 publish-subscribe mode)? What about two Spark streaming jobs read from the
 same topic?

 2. How to avoid data loss if a Spark job is killed? Does checkpointing
 serve this purpose? The default behavior of Spark streaming is to read the
 latest logs. However, if a job is killed, can the new job resume from what
 was left to avoid loosing logs?

 Thanks!

 Bill