[ https://issues.apache.org/jira/browse/SPARK-12864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15106129#comment-15106129 ]
iward commented on SPARK-12864: ------------------------------- Yeah, you are right. When AM is exited, All executors will be exited, but the BlockManagerMaster in driver still store the data information computed by executor which already has exited. And to shuffle this data must through executorId. This time if we re-register an AM, the executorIdCounter will reset to 0. So, when a new executor which executorId is `12` to shuffle data which computed by exited executor which executorId also `12`. This situation the new executor won't shuffle this data from remote, it will get the data from local, so, as my task log show, data is not found(No such file or directory). My spark version is 1.3.1 and 1.5.2. > initialize executorIdCounter after ApplicationMaster killed for max number > of executor failures reached > -------------------------------------------------------------------------------------------------------- > > Key: SPARK-12864 > URL: https://issues.apache.org/jira/browse/SPARK-12864 > Project: Spark > Issue Type: Bug > Components: YARN > Affects Versions: 1.3.1, 1.4.1, 1.5.2 > Reporter: iward > > Currently, when max number of executor failures reached the > *maxNumExecutorFailures*, *ApplicationMaster* will be killed and re-register > another one.This time, *YarnAllocator* will be created a new instance. > But, the value of property *executorIdCounter* in *YarnAllocator* will reset > to *0*. Then the *Id* of new executor will starting from 1. This will confuse > with the executor has already created before, which will cause > FetchFailedException. > For example, the following is the task log: > {noformat} > 2015-12-22 02:33:15 INFO 15/12/22 02:33:15 WARN > YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster has > disassociated: 172.22.92.14:45125 > 2015-12-22 02:33:26 INFO 15/12/22 02:33:26 INFO > YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as > AkkaRpcEndpointRef(Actor[akka.tcp://sparkYarnAM@172.22.168.72:54040/user/YarnAM#-1290854604]) > {noformat} > {noformat} > 2015-12-22 02:35:02 INFO 15/12/22 02:35:02 INFO YarnClientSchedulerBackend: > Registered executor: > AkkaRpcEndpointRef(Actor[akka.tcp://sparkexecu...@bjhc-hera-16217.hadoop.jd.local:46538/user/Executor#-790726793]) > with ID 1 > {noformat} > {noformat} > Lost task 3.0 in stage 102.0 (TID 1963, BJHC-HERA-16217.hadoop.jd.local): > FetchFailed(BlockManagerId(1, BJHC-HERA-17030.hadoop.jd.local, 7337 > ), shuffleId=5, mapId=2, reduceId=3, message= > 2015-12-22 02:43:20 INFO org.apache.spark.shuffle.FetchFailedException: > /data3/yarn1/local/usercache/dd_edw/appcache/application_1450438154359_206399/blockmgr-b1fd0363-6d53-4d09-8086-adc4a13f4dc4/0f/shuffl > e_5_2_0.index (No such file or directory) > 2015-12-22 02:43:20 INFO at > org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67) > 2015-12-22 02:43:20 INFO at > org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:84) > 2015-12-22 02:43:20 INFO at > org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:84) > 2015-12-22 02:43:20 INFO at > scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > 2015-12-22 02:43:20 INFO at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > 2015-12-22 02:43:20 INFO at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > 2015-12-22 02:43:20 INFO at > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > 2015-12-22 02:43:20 INFO at > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > 2015-12-22 02:43:20 INFO at > org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:154) > 2015-12-22 02:43:20 INFO at > org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:149) > 2015-12-22 02:43:20 INFO at > org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:640) > 2015-12-22 02:43:20 INFO at > org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:640) > 2015-12-22 02:43:20 INFO at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > 2015-12-22 02:43:20 INFO at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > {noformat} > As the task log show, the executor id of *BJHC-HERA-16217.hadoop.jd.local* > is the same as *BJHC-HERA-17030.hadoop.jd.local*. So, it is confusion and > cause FetchFailedException. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org