[jira] [Commented] (SPARK-7827) kafka streaming NotLeaderForPartition Exception could not be handled normally

Cody Koeninger (JIRA) Fri, 22 May 2015 12:03:18 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-7827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14556631#comment-14556631
 ]


Cody Koeninger commented on SPARK-7827:
---------------------------------------

As TD said, it looks like it did retry 4 times.

You can set refresh.leader.backoff.ms in kafka params to affect the sleep 
duration before throwing the exception. Kafka default is 200ms.  Can you tell 
how long Kafka took to rebalance?



> kafka streaming NotLeaderForPartition Exception could not be handled normally
> -----------------------------------------------------------------------------
>
>                 Key: SPARK-7827
>                 URL: https://issues.apache.org/jira/browse/SPARK-7827
>             Project: Spark
>          Issue Type: Bug
>          Components: Streaming
>    Affects Versions: 1.3.1
>         Environment: spark 1.3.1,  on yarn,   hadoop 2.6.0  
>            Reporter: hotdog
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> This is my Cluster 's log, once the partition's leader could not be found, 
> the total streaming task will fail...
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 11 in 
> stage 3549.0 failed 4 times, most recent failure: Lost task 11.3 in stage 
> 3549.0 (TID 385491, td-h85.cp): kafka.common.NotLeaderForPartitionException
>       at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>       at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>       at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>       at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>       at java.lang.Class.newInstance(Class.java:374)
>       at kafka.common.ErrorMapping$.exceptionFor(ErrorMapping.scala:73)
>       at 
> org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.handleFetchErr(KafkaRDD.scala:142)
>       at 
> org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.fetchBatch(KafkaRDD.scala:151)
>       at 
> org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.getNext(KafkaRDD.scala:162)
>       at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
>       at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>       at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:217)
>       at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
>       at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>       at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>       at org.apache.spark.scheduler.Task.run(Task.scala:64)
>       at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>       at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
>       at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
>       at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
>       at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
>       at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>       at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>       at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
>       at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
>       at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
>       at scala.Option.foreach(Option.scala:236)
>       at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
>       at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
>       at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
>       at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> I read the source code of KafkaRDD.scala:
>   private def handleFetchErr(resp: FetchResponse) {
>       if (resp.hasError) {
>         val err = resp.errorCode(part.topic, part.partition)
>         if (err == ErrorMapping.LeaderNotAvailableCode ||
>           err == ErrorMapping.NotLeaderForPartitionCode) {
>           log.error(s"Lost leader for topic ${part.topic} partition 
> ${part.partition}, " +
>             s" sleeping for ${kc.config.refreshLeaderBackoffMs}ms")
>           Thread.sleep(kc.config.refreshLeaderBackoffMs)
>         }
>         // Let normal rdd retry sort out reconnect attempts
>         throw ErrorMapping.exceptionFor(err)
>       }
>     }
> it seems the code throw a NotLeaderForPartition exception and expect the 
> normal rdd retry.  but why the rdd retry was not performed and the job failed 
> suddenly



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7827) kafka streaming NotLeaderForPartition Exception could not be handled normally

Reply via email to