subject:"Re\: Spark Streaming \+ Kafka failure recovery"

Re: Spark Streaming + Kafka failure recovery

2015-05-21 Thread Bill Jay

Hi Cody,

That is clear. Thanks!

Bill

On Tue, May 19, 2015 at 1:27 PM, Cody Koeninger c...@koeninger.org wrote:

If you checkpoint, the job will start from the successfully consumed
offsets. If you don't checkpoint, by default it will start from the
highest available offset, and you will potentially lose data.

Is the link I posted, or for that matter the scaladoc, really not clear on
that point?

The scaladoc says:

To recover from driver failures, you have to enable checkpointing in the
StreamingContext
http://spark.apache.org/docs/latest/api/scala/org/apache/spark/streaming/StreamingContext.html.
The information on consumed offset can be recovered from the checkpoint.

On Tue, May 19, 2015 at 2:38 PM, Bill Jay bill.jaypeter...@gmail.com
wrote:

If a Spark streaming job stops at 12:01 and I resume the job at 12:02.
Will it still start to consume the data that were produced to Kafka at
12:01? Or it will just start consuming from the current time?

On Tue, May 19, 2015 at 10:58 AM, Cody Koeninger c...@koeninger.org
wrote:

Have you read
https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md
?

1. There's nothing preventing that.

2. Checkpointing will give you at-least-once semantics, provided you
have sufficient kafka retention. Be aware that checkpoints aren't
recoverable if you upgrade code.

On Tue, May 19, 2015 at 12:42 PM, Bill Jay bill.jaypeter...@gmail.com
wrote:

Hi all,

I am currently using Spark streaming to consume and save logs every
hour in our production pipeline. The current setting is to run a crontab
job to check every minute whether the job is still there and if not
resubmit a Spark streaming job. I am currently using the direct approach
for Kafka consumer. I have two questions:

1. In the direct approach, no offset is stored in zookeeper and no
group id is specified. Can two consumers (one is Spark streaming and the
other is a Kafak console consumer in Kafka package) read from the same
topic from the brokers together (I would like both of them to get all
messages, i.e. publish-subscribe mode)? What about two Spark streaming jobs
read from the same topic?

2. How to avoid data loss if a Spark job is killed? Does checkpointing
serve this purpose? The default behavior of Spark streaming is to read the
latest logs. However, if a job is killed, can the new job resume from what
was left to avoid loosing logs?

Thanks!

Bill

Re: Spark Streaming + Kafka failure recovery

2015-05-19 Thread Cody Koeninger

If you checkpoint, the job will start from the successfully consumed
offsets. If you don't checkpoint, by default it will start from the
highest available offset, and you will potentially lose data.

Is the link I posted, or for that matter the scaladoc, really not clear on
that point?

The scaladoc says:

On Tue, May 19, 2015 at 2:38 PM, Bill Jay bill.jaypeter...@gmail.com
wrote:

On Tue, May 19, 2015 at 10:58 AM, Cody Koeninger c...@koeninger.org
wrote:

Have you read
https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md ?

1. There's nothing preventing that.

2. Checkpointing will give you at-least-once semantics, provided you have
sufficient kafka retention. Be aware that checkpoints aren't recoverable
if you upgrade code.

On Tue, May 19, 2015 at 12:42 PM, Bill Jay bill.jaypeter...@gmail.com
wrote:

Hi all,

I am currently using Spark streaming to consume and save logs every hour
in our production pipeline. The current setting is to run a crontab job to
check every minute whether the job is still there and if not resubmit a
Spark streaming job. I am currently using the direct approach for Kafka
consumer. I have two questions:

1. In the direct approach, no offset is stored in zookeeper and no group
id is specified. Can two consumers (one is Spark streaming and the other is
a Kafak console consumer in Kafka package) read from the same topic from
the brokers together (I would like both of them to get all messages, i.e.
publish-subscribe mode)? What about two Spark streaming jobs read from the
same topic?

Thanks!

Bill

Re: Spark Streaming + Kafka failure recovery

2015-05-19 Thread Cody Koeninger

Have you read
https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md ?

1.  There's nothing preventing that.

2. Checkpointing will give you at-least-once semantics, provided you have
sufficient kafka retention.  Be aware that checkpoints aren't recoverable
if you upgrade code.

On Tue, May 19, 2015 at 12:42 PM, Bill Jay bill.jaypeter...@gmail.com
wrote:

 Hi all,

 I am currently using Spark streaming to consume and save logs every hour
 in our production pipeline. The current setting is to run a crontab job to
 check every minute whether the job is still there and if not resubmit a
 Spark streaming job. I am currently using the direct approach for Kafka
 consumer. I have two questions:

 1. In the direct approach, no offset is stored in zookeeper and no group
 id is specified. Can two consumers (one is Spark streaming and the other is
 a Kafak console consumer in Kafka package) read from the same topic from
 the brokers together (I would like both of them to get all messages, i.e.
 publish-subscribe mode)? What about two Spark streaming jobs read from the
 same topic?

 2. How to avoid data loss if a Spark job is killed? Does checkpointing
 serve this purpose? The default behavior of Spark streaming is to read the
 latest logs. However, if a job is killed, can the new job resume from what
 was left to avoid loosing logs?

 Thanks!

 Bill

Re: Spark Streaming + Kafka failure recovery

2015-05-19 Thread Bill Jay

If a Spark streaming job stops at 12:01 and I resume the job at 12:02. Will
it still start to consume the data that were produced to Kafka at 12:01? Or
it will just start consuming from the current time?


On Tue, May 19, 2015 at 10:58 AM, Cody Koeninger c...@koeninger.org wrote:

 Have you read
 https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md ?

 1.  There's nothing preventing that.

 2. Checkpointing will give you at-least-once semantics, provided you have
 sufficient kafka retention.  Be aware that checkpoints aren't recoverable
 if you upgrade code.

 On Tue, May 19, 2015 at 12:42 PM, Bill Jay bill.jaypeter...@gmail.com
 wrote:

 Hi all,

 I am currently using Spark streaming to consume and save logs every hour
 in our production pipeline. The current setting is to run a crontab job to
 check every minute whether the job is still there and if not resubmit a
 Spark streaming job. I am currently using the direct approach for Kafka
 consumer. I have two questions:

 1. In the direct approach, no offset is stored in zookeeper and no group
 id is specified. Can two consumers (one is Spark streaming and the other is
 a Kafak console consumer in Kafka package) read from the same topic from
 the brokers together (I would like both of them to get all messages, i.e.
 publish-subscribe mode)? What about two Spark streaming jobs read from the
 same topic?

 2. How to avoid data loss if a Spark job is killed? Does checkpointing
 serve this purpose? The default behavior of Spark streaming is to read the
 latest logs. However, if a job is killed, can the new job resume from what
 was left to avoid loosing logs?

 Thanks!

 Bill

Re: Spark Streaming + Kafka failure recovery

Re: Spark Streaming + Kafka failure recovery

Re: Spark Streaming + Kafka failure recovery

Re: Spark Streaming + Kafka failure recovery

4 matches

Site Navigation

Mail list logo

Footer information