[ https://issues.apache.org/jira/browse/SPARK-12864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Thomas Graves resolved SPARK-12864. ----------------------------------- Resolution: Fixed Fix Version/s: 2.0.0 > Fetch failure from AM restart > ----------------------------- > > Key: SPARK-12864 > URL: https://issues.apache.org/jira/browse/SPARK-12864 > Project: Spark > Issue Type: Bug > Components: YARN > Affects Versions: 1.3.1, 1.4.1, 1.5.2 > Reporter: iward > Fix For: 2.0.0 > > > Currently, when max number of executor failures reached the > *maxNumExecutorFailures*, *ApplicationMaster* will be killed and re-register > another one.This time, *YarnAllocator* will be created a new instance. > But, the value of property *executorIdCounter* in *YarnAllocator* will reset > to *0*. Then the *Id* of new executor will starting from 1. This will confuse > with the executor has already created before, which will cause > FetchFailedException. > For example, the following is the task log: > {noformat} > 2015-12-22 02:33:15 INFO 15/12/22 02:33:15 WARN > YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster has > disassociated: 172.22.92.14:45125 > 2015-12-22 02:33:26 INFO 15/12/22 02:33:26 INFO > YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as > AkkaRpcEndpointRef(Actor[akka.tcp://sparkYarnAM@172.22.168.72:54040/user/YarnAM#-1290854604]) > {noformat} > {noformat} > 2015-12-22 02:35:02 INFO 15/12/22 02:35:02 INFO YarnClientSchedulerBackend: > Registered executor: > AkkaRpcEndpointRef(Actor[akka.tcp://sparkexecu...@bjhc-hera-16217.hadoop.jd.local:46538/user/Executor#-790726793]) > with ID 1 > {noformat} > {noformat} > Lost task 3.0 in stage 102.0 (TID 1963, BJHC-HERA-16217.hadoop.jd.local): > FetchFailed(BlockManagerId(1, BJHC-HERA-17030.hadoop.jd.local, 7337 > ), shuffleId=5, mapId=2, reduceId=3, message= > 2015-12-22 02:43:20 INFO org.apache.spark.shuffle.FetchFailedException: > /data3/yarn1/local/usercache/dd_edw/appcache/application_1450438154359_206399/blockmgr-b1fd0363-6d53-4d09-8086-adc4a13f4dc4/0f/shuffl > e_5_2_0.index (No such file or directory) > 2015-12-22 02:43:20 INFO at > org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67) > 2015-12-22 02:43:20 INFO at > org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:84) > 2015-12-22 02:43:20 INFO at > org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:84) > 2015-12-22 02:43:20 INFO at > scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > 2015-12-22 02:43:20 INFO at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > 2015-12-22 02:43:20 INFO at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > 2015-12-22 02:43:20 INFO at > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > 2015-12-22 02:43:20 INFO at > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > 2015-12-22 02:43:20 INFO at > org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:154) > 2015-12-22 02:43:20 INFO at > org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:149) > 2015-12-22 02:43:20 INFO at > org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:640) > 2015-12-22 02:43:20 INFO at > org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:640) > 2015-12-22 02:43:20 INFO at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > 2015-12-22 02:43:20 INFO at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > {noformat} > As the task log show, the executor id of *BJHC-HERA-16217.hadoop.jd.local* > is the same as *BJHC-HERA-17030.hadoop.jd.local*. So, it is confusion and > cause FetchFailedException. > *And this situation of executorId conflict is just in yarn client mode due to > driver not running on yarn.* -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org