Re: Lost leader exception in Kafka Direct for Streaming

2015-10-21 Thread Cody Koeninger
You can try running the driver in the cluster manager with --supervise, but
that's basically the same as restarting it when it fails.

There is no reasonable automatic "recovery" when something is fundamentally
wrong with your kafka cluster.

On Wed, Oct 21, 2015 at 12:46 AM, swetha kasireddy <
swethakasire...@gmail.com> wrote:

> Hi Cody,
>
> What other options do I have other than monitoring and restarting the job?
> Can the job recover automatically?
>
> Thanks,
> Sweth
>
> On Thu, Oct 1, 2015 at 7:18 AM, Cody Koeninger <c...@koeninger.org> wrote:
>
>> Did you check you kafka broker logs to see what was going on during that
>> time?
>>
>> The direct stream will handle normal leader loss / rebalance by retrying
>> tasks.
>>
>> But the exception you got indicates that something with kafka was wrong,
>> such that offsets were being re-used.
>>
>> ie. your job already processed up through beginning offset 15027734702
>>
>> but when asking kafka for the highest available offsets, it returns
>> ending offset 15027725493
>>
>> which is lower, in other words kafka lost messages.  This might happen
>> because you lost a leader and recovered from a replica that wasn't in sync,
>> or someone manually screwed up a topic, or ... ?
>>
>> If you really want to just blindly "recover" from this situation (even
>> though something is probably wrong with your data), the most
>> straightforward thing to do is monitor and restart your job.
>>
>>
>>
>>
>> On Wed, Sep 30, 2015 at 4:31 PM, swetha <swethakasire...@gmail.com>
>> wrote:
>>
>>>
>>> Hi,
>>>
>>> I see this sometimes in our Kafka Direct approach in our Streaming job.
>>> How
>>> do we make sure that the job recovers from such errors and works normally
>>> thereafter?
>>>
>>> 15/09/30 05:14:18 ERROR KafkaRDD: Lost leader for topic x_stream
>>> partition
>>> 19,  sleeping for 200ms
>>> 15/09/30 05:14:18 ERROR KafkaRDD: Lost leader for topic x_stream
>>> partition
>>> 5,  sleeping for 200ms
>>>
>>> Followed by every task failing with something like this:
>>>
>>> 15/09/30 05:26:20 ERROR Executor: Exception in task 4.0 in stage 84281.0
>>> (TID 818804)
>>> kafka.common.NotLeaderForPartitionException
>>>
>>> And:
>>>
>>> org.apache.spark.SparkException: Job aborted due to stage failure: Task
>>> 15
>>> in stage 84958.0 failed 4 times, most recent failure: Lost task 15.3 in
>>> stage 84958.0 (TID 819461, 10.227.68.102): java.lang.AssertionError:
>>> assertion failed: Beginning offset 15027734702 is after the ending
>>> offset
>>> 15027725493 for topic hubble_stream partition 12. You either provided an
>>> invalid fromOffset, or the Kafka topic has been damaged
>>>
>>>
>>> Thanks,
>>> Swetha
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Lost-leader-exception-in-Kafka-Direct-for-Streaming-tp24891.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>


Re: Lost leader exception in Kafka Direct for Streaming

2015-10-20 Thread swetha kasireddy
Hi Cody,

What other options do I have other than monitoring and restarting the job?
Can the job recover automatically?

Thanks,
Sweth

On Thu, Oct 1, 2015 at 7:18 AM, Cody Koeninger <c...@koeninger.org> wrote:

> Did you check you kafka broker logs to see what was going on during that
> time?
>
> The direct stream will handle normal leader loss / rebalance by retrying
> tasks.
>
> But the exception you got indicates that something with kafka was wrong,
> such that offsets were being re-used.
>
> ie. your job already processed up through beginning offset 15027734702
>
> but when asking kafka for the highest available offsets, it returns ending
> offset 15027725493
>
> which is lower, in other words kafka lost messages.  This might happen
> because you lost a leader and recovered from a replica that wasn't in sync,
> or someone manually screwed up a topic, or ... ?
>
> If you really want to just blindly "recover" from this situation (even
> though something is probably wrong with your data), the most
> straightforward thing to do is monitor and restart your job.
>
>
>
>
> On Wed, Sep 30, 2015 at 4:31 PM, swetha <swethakasire...@gmail.com> wrote:
>
>>
>> Hi,
>>
>> I see this sometimes in our Kafka Direct approach in our Streaming job.
>> How
>> do we make sure that the job recovers from such errors and works normally
>> thereafter?
>>
>> 15/09/30 05:14:18 ERROR KafkaRDD: Lost leader for topic x_stream partition
>> 19,  sleeping for 200ms
>> 15/09/30 05:14:18 ERROR KafkaRDD: Lost leader for topic x_stream partition
>> 5,  sleeping for 200ms
>>
>> Followed by every task failing with something like this:
>>
>> 15/09/30 05:26:20 ERROR Executor: Exception in task 4.0 in stage 84281.0
>> (TID 818804)
>> kafka.common.NotLeaderForPartitionException
>>
>> And:
>>
>> org.apache.spark.SparkException: Job aborted due to stage failure: Task 15
>> in stage 84958.0 failed 4 times, most recent failure: Lost task 15.3 in
>> stage 84958.0 (TID 819461, 10.227.68.102): java.lang.AssertionError:
>> assertion failed: Beginning offset 15027734702 is after the ending offset
>> 15027725493 for topic hubble_stream partition 12. You either provided an
>> invalid fromOffset, or the Kafka topic has been damaged
>>
>>
>> Thanks,
>> Swetha
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Lost-leader-exception-in-Kafka-Direct-for-Streaming-tp24891.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: Lost leader exception in Kafka Direct for Streaming

2015-10-01 Thread Cody Koeninger
Did you check you kafka broker logs to see what was going on during that
time?

The direct stream will handle normal leader loss / rebalance by retrying
tasks.

But the exception you got indicates that something with kafka was wrong,
such that offsets were being re-used.

ie. your job already processed up through beginning offset 15027734702

but when asking kafka for the highest available offsets, it returns ending
offset 15027725493

which is lower, in other words kafka lost messages.  This might happen
because you lost a leader and recovered from a replica that wasn't in sync,
or someone manually screwed up a topic, or ... ?

If you really want to just blindly "recover" from this situation (even
though something is probably wrong with your data), the most
straightforward thing to do is monitor and restart your job.




On Wed, Sep 30, 2015 at 4:31 PM, swetha <swethakasire...@gmail.com> wrote:

>
> Hi,
>
> I see this sometimes in our Kafka Direct approach in our Streaming job. How
> do we make sure that the job recovers from such errors and works normally
> thereafter?
>
> 15/09/30 05:14:18 ERROR KafkaRDD: Lost leader for topic x_stream partition
> 19,  sleeping for 200ms
> 15/09/30 05:14:18 ERROR KafkaRDD: Lost leader for topic x_stream partition
> 5,  sleeping for 200ms
>
> Followed by every task failing with something like this:
>
> 15/09/30 05:26:20 ERROR Executor: Exception in task 4.0 in stage 84281.0
> (TID 818804)
> kafka.common.NotLeaderForPartitionException
>
> And:
>
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 15
> in stage 84958.0 failed 4 times, most recent failure: Lost task 15.3 in
> stage 84958.0 (TID 819461, 10.227.68.102): java.lang.AssertionError:
> assertion failed: Beginning offset 15027734702 is after the ending offset
> 15027725493 for topic hubble_stream partition 12. You either provided an
> invalid fromOffset, or the Kafka topic has been damaged
>
>
> Thanks,
> Swetha
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Lost-leader-exception-in-Kafka-Direct-for-Streaming-tp24891.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Lost leader exception in Kafka Direct for Streaming

2015-10-01 Thread Adrian Tanase
This also happened to me in extreme recovery scenarios – e.g. Killing 4 out of 
a  7 machine cluster.

I’d put my money on recovering from an out of sync replica, although I haven’t 
done extensive testing around it.

-adrian

From: Cody Koeninger
Date: Thursday, October 1, 2015 at 5:18 PM
To: swetha
Cc: "user@spark.apache.org<mailto:user@spark.apache.org>"
Subject: Re: Lost leader exception in Kafka Direct for Streaming

Did you check you kafka broker logs to see what was going on during that time?

The direct stream will handle normal leader loss / rebalance by retrying tasks.

But the exception you got indicates that something with kafka was wrong, such 
that offsets were being re-used.

ie. your job already processed up through beginning offset 15027734702

but when asking kafka for the highest available offsets, it returns ending 
offset 15027725493

which is lower, in other words kafka lost messages.  This might happen because 
you lost a leader and recovered from a replica that wasn't in sync, or someone 
manually screwed up a topic, or ... ?

If you really want to just blindly "recover" from this situation (even though 
something is probably wrong with your data), the most straightforward thing to 
do is monitor and restart your job.




On Wed, Sep 30, 2015 at 4:31 PM, swetha 
<swethakasire...@gmail.com<mailto:swethakasire...@gmail.com>> wrote:

Hi,

I see this sometimes in our Kafka Direct approach in our Streaming job. How
do we make sure that the job recovers from such errors and works normally
thereafter?

15/09/30 05:14:18 ERROR KafkaRDD: Lost leader for topic x_stream partition
19,  sleeping for 200ms
15/09/30 05:14:18 ERROR KafkaRDD: Lost leader for topic x_stream partition
5,  sleeping for 200ms

Followed by every task failing with something like this:

15/09/30 05:26:20 ERROR Executor: Exception in task 4.0 in stage 84281.0
(TID 818804)
kafka.common.NotLeaderForPartitionException

And:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 15
in stage 84958.0 failed 4 times, most recent failure: Lost task 15.3 in
stage 84958.0 (TID 819461, 10.227.68.102): java.lang.AssertionError:
assertion failed: Beginning offset 15027734702 is after the ending offset
15027725493 for topic hubble_stream partition 12. You either provided an
invalid fromOffset, or the Kafka topic has been damaged


Thanks,
Swetha




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Lost-leader-exception-in-Kafka-Direct-for-Streaming-tp24891.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>
For additional commands, e-mail: 
user-h...@spark.apache.org<mailto:user-h...@spark.apache.org>




Lost leader exception in Kafka Direct for Streaming

2015-09-30 Thread swetha

Hi,

I see this sometimes in our Kafka Direct approach in our Streaming job. How
do we make sure that the job recovers from such errors and works normally
thereafter?

15/09/30 05:14:18 ERROR KafkaRDD: Lost leader for topic x_stream partition
19,  sleeping for 200ms
15/09/30 05:14:18 ERROR KafkaRDD: Lost leader for topic x_stream partition
5,  sleeping for 200ms

Followed by every task failing with something like this:

15/09/30 05:26:20 ERROR Executor: Exception in task 4.0 in stage 84281.0
(TID 818804)
kafka.common.NotLeaderForPartitionException

And:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 15
in stage 84958.0 failed 4 times, most recent failure: Lost task 15.3 in
stage 84958.0 (TID 819461, 10.227.68.102): java.lang.AssertionError:
assertion failed: Beginning offset 15027734702 is after the ending offset
15027725493 for topic hubble_stream partition 12. You either provided an
invalid fromOffset, or the Kafka topic has been damaged


Thanks,
Swetha




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Lost-leader-exception-in-Kafka-Direct-for-Streaming-tp24891.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org