[ https://issues.apache.org/jira/browse/SPARK-7827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14556631#comment-14556631 ]
Cody Koeninger commented on SPARK-7827: --------------------------------------- As TD said, it looks like it did retry 4 times. You can set refresh.leader.backoff.ms in kafka params to affect the sleep duration before throwing the exception. Kafka default is 200ms. Can you tell how long Kafka took to rebalance? > kafka streaming NotLeaderForPartition Exception could not be handled normally > ----------------------------------------------------------------------------- > > Key: SPARK-7827 > URL: https://issues.apache.org/jira/browse/SPARK-7827 > Project: Spark > Issue Type: Bug > Components: Streaming > Affects Versions: 1.3.1 > Environment: spark 1.3.1, on yarn, hadoop 2.6.0 > Reporter: hotdog > Original Estimate: 120h > Remaining Estimate: 120h > > This is my Cluster 's log, once the partition's leader could not be found, > the total streaming task will fail... > org.apache.spark.SparkException: Job aborted due to stage failure: Task 11 in > stage 3549.0 failed 4 times, most recent failure: Lost task 11.3 in stage > 3549.0 (TID 385491, td-h85.cp): kafka.common.NotLeaderForPartitionException > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:526) > at java.lang.Class.newInstance(Class.java:374) > at kafka.common.ErrorMapping$.exceptionFor(ErrorMapping.scala:73) > at > org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.handleFetchErr(KafkaRDD.scala:142) > at > org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.fetchBatch(KafkaRDD.scala:151) > at > org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.getNext(KafkaRDD.scala:162) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at > org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:217) > at > org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:64) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > I read the source code of KafkaRDD.scala: > private def handleFetchErr(resp: FetchResponse) { > if (resp.hasError) { > val err = resp.errorCode(part.topic, part.partition) > if (err == ErrorMapping.LeaderNotAvailableCode || > err == ErrorMapping.NotLeaderForPartitionCode) { > log.error(s"Lost leader for topic ${part.topic} partition > ${part.partition}, " + > s" sleeping for ${kc.config.refreshLeaderBackoffMs}ms") > Thread.sleep(kc.config.refreshLeaderBackoffMs) > } > // Let normal rdd retry sort out reconnect attempts > throw ErrorMapping.exceptionFor(err) > } > } > it seems the code throw a NotLeaderForPartition exception and expect the > normal rdd retry. but why the rdd retry was not performed and the job failed > suddenly -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org