[ https://issues.apache.org/jira/browse/SPARK-12864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15105671#comment-15105671 ]
Saisai Shao commented on SPARK-12864: ------------------------------------- What's Spark version are you using? I remember I fixed a similar AM re-attempt issue before. As I remembered, when AM is exited, all the related containers/executors will be exited as well, and when another attempt is started, it will initialize the AM from scratch, so there should be no problem. Did you mean that when AM is failed, all the containers are still running in your cluster? If so, that's a kind of weird, would you please elaborate what you saw, thanks a lot. > initialize executorIdCounter after ApplicationMaster killed for max number > of executor failures reached > -------------------------------------------------------------------------------------------------------- > > Key: SPARK-12864 > URL: https://issues.apache.org/jira/browse/SPARK-12864 > Project: Spark > Issue Type: Bug > Components: YARN > Affects Versions: 1.3.1, 1.4.1, 1.5.2 > Reporter: iward > > Currently, when max number of executor failures reached the > *maxNumExecutorFailures*, *ApplicationMaster* will be killed and re-register > another one.This time, *YarnAllocator* will be created a new instance. > But, the value of property *executorIdCounter* in *YarnAllocator* will reset > to *0*. Then the *Id* of new executor will starting from 1. This will confuse > with the executor has already created before, which will cause > FetchFailedException. > For example, the following is the task log: > {noformat} > 2015-12-22 02:33:15 INFO 15/12/22 02:33:15 WARN > YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster has > disassociated: 172.22.92.14:45125 > 2015-12-22 02:33:26 INFO 15/12/22 02:33:26 INFO > YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as > AkkaRpcEndpointRef(Actor[akka.tcp://sparkYarnAM@172.22.168.72:54040/user/YarnAM#-1290854604]) > {noformat} > {noformat} > 2015-12-22 02:35:02 INFO 15/12/22 02:35:02 INFO YarnClientSchedulerBackend: > Registered executor: > AkkaRpcEndpointRef(Actor[akka.tcp://sparkexecu...@bjhc-hera-16217.hadoop.jd.local:46538/user/Executor#-790726793]) > with ID 1 > {noformat} > {noformat} > Lost task 3.0 in stage 102.0 (TID 1963, BJHC-HERA-16217.hadoop.jd.local): > FetchFailed(BlockManagerId(1, BJHC-HERA-17030.hadoop.jd.local, 7337 > ), shuffleId=5, mapId=2, reduceId=3, message= > 2015-12-22 02:43:20 INFO org.apache.spark.shuffle.FetchFailedException: > /data3/yarn1/local/usercache/dd_edw/appcache/application_1450438154359_206399/blockmgr-b1fd0363-6d53-4d09-8086-adc4a13f4dc4/0f/shuffl > e_5_2_0.index (No such file or directory) > 2015-12-22 02:43:20 INFO at > org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67) > 2015-12-22 02:43:20 INFO at > org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:84) > 2015-12-22 02:43:20 INFO at > org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:84) > 2015-12-22 02:43:20 INFO at > scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > 2015-12-22 02:43:20 INFO at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > 2015-12-22 02:43:20 INFO at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > 2015-12-22 02:43:20 INFO at > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > 2015-12-22 02:43:20 INFO at > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > 2015-12-22 02:43:20 INFO at > org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:154) > 2015-12-22 02:43:20 INFO at > org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:149) > 2015-12-22 02:43:20 INFO at > org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:640) > 2015-12-22 02:43:20 INFO at > org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:640) > 2015-12-22 02:43:20 INFO at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > 2015-12-22 02:43:20 INFO at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > {noformat} > As the task log show, the executor id of *BJHC-HERA-16217.hadoop.jd.local* > is the same as *BJHC-HERA-17030.hadoop.jd.local*. So, it is confusion and > cause FetchFailedException. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org