Re: Lost leader exception in Kafka Direct for Streaming

Adrian Tanase Thu, 01 Oct 2015 07:44:29 -0700

This also happened to me in extreme recovery scenarios – e.g. Killing 4 out of 
a  7 machine cluster.

I’d put my money on recovering from an out of sync replica, although I haven’t 
done extensive testing around it.

-adrian

From: Cody Koeninger
Date: Thursday, October 1, 2015 at 5:18 PM
To: swetha
Cc: "user@spark.apache.org<mailto:user@spark.apache.org>"
Subject: Re: Lost leader exception in Kafka Direct for Streaming

Did you check you kafka broker logs to see what was going on during that time?

The direct stream will handle normal leader loss / rebalance by retrying tasks.

But the exception you got indicates that something with kafka was wrong, such 
that offsets were being re-used.

ie. your job already processed up through beginning offset 15027734702

but when asking kafka for the highest available offsets, it returns ending 
offset 15027725493

which is lower, in other words kafka lost messages.  This might happen because 
you lost a leader and recovered from a replica that wasn't in sync, or someone 
manually screwed up a topic, or ... ?

If you really want to just blindly "recover" from this situation (even though 
something is probably wrong with your data), the most straightforward thing to 
do is monitor and restart your job.

On Wed, Sep 30, 2015 at 4:31 PM, swetha 
<swethakasire...@gmail.com<mailto:swethakasire...@gmail.com>> wrote:

Hi,

I see this sometimes in our Kafka Direct approach in our Streaming job. How
do we make sure that the job recovers from such errors and works normally
thereafter?

15/09/30 05:14:18 ERROR KafkaRDD: Lost leader for topic x_stream partition
19,  sleeping for 200ms
15/09/30 05:14:18 ERROR KafkaRDD: Lost leader for topic x_stream partition
5,  sleeping for 200ms

Followed by every task failing with something like this:

15/09/30 05:26:20 ERROR Executor: Exception in task 4.0 in stage 84281.0
(TID 818804)
kafka.common.NotLeaderForPartitionException

And:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 15
in stage 84958.0 failed 4 times, most recent failure: Lost task 15.3 in
stage 84958.0 (TID 819461, 10.227.68.102): java.lang.AssertionError:
assertion failed: Beginning offset 15027734702 is after the ending offset
15027725493 for topic hubble_stream partition 12. You either provided an
invalid fromOffset, or the Kafka topic has been damaged

Thanks,
Swetha

--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Lost-leader-exception-in-Kafka-Direct-for-Streaming-tp24891.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>
For additional commands, e-mail: 
user-h...@spark.apache.org<mailto:user-h...@spark.apache.org>

Re: Lost leader exception in Kafka Direct for Streaming

Reply via email to