[jira] [Comment Edited] (SPARK-12864) initialize executorIdCounter after ApplicationMaster killed for max number of executor failures reached

Saisai Shao (JIRA) Mon, 18 Jan 2016 11:04:32 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-12864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15105671#comment-15105671
 ]


Saisai Shao edited comment on SPARK-12864 at 1/18/16 7:03 PM:
--------------------------------------------------------------

What Spark version are you using? I remember I fixed a similar AM re-attempt 
issue before. As I remembered, when AM is exited, all the related 
containers/executors will be exited as well, and when another attempt is 
started, it will initialize the AM from scratch, so there should be no problem.

Did you mean that when AM is failed, all the containers are still running in 
your cluster? If so, that's a kind of weird, would you please elaborate what 
you saw, thanks a lot.


was (Author: jerryshao):
What's Spark version are you using? I remember I fixed a similar AM re-attempt 
issue before. As I remembered, when AM is exited, all the related 
containers/executors will be exited as well, and when another attempt is 
started, it will initialize the AM from scratch, so there should be no problem.

Did you mean that when AM is failed, all the containers are still running in 
your cluster? If so, that's a kind of weird, would you please elaborate what 
you saw, thanks a lot.

>  initialize executorIdCounter after ApplicationMaster killed for max number 
> of executor failures reached
> --------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-12864
>                 URL: https://issues.apache.org/jira/browse/SPARK-12864
>             Project: Spark
>          Issue Type: Bug
>          Components: YARN
>    Affects Versions: 1.3.1, 1.4.1, 1.5.2
>            Reporter: iward
>
> Currently, when max number of executor failures reached the 
> *maxNumExecutorFailures*,  *ApplicationMaster* will be killed and re-register 
> another one.This time, *YarnAllocator* will be created a new instance.
> But, the value of property *executorIdCounter* in  *YarnAllocator* will reset 
> to *0*. Then the *Id* of new executor will starting from 1. This will confuse 
> with the executor has already created before, which will cause 
> FetchFailedException.
> For example, the following is the task log:
> {noformat}
> 2015-12-22 02:33:15 INFO 15/12/22 02:33:15 WARN 
> YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster has 
> disassociated: 172.22.92.14:45125
> 2015-12-22 02:33:26 INFO 15/12/22 02:33:26 INFO 
> YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as 
> AkkaRpcEndpointRef(Actor[akka.tcp://sparkYarnAM@172.22.168.72:54040/user/YarnAM#-1290854604])
> {noformat}
> {noformat}
> 2015-12-22 02:35:02 INFO 15/12/22 02:35:02 INFO YarnClientSchedulerBackend: 
> Registered executor: 
> AkkaRpcEndpointRef(Actor[akka.tcp://sparkexecu...@bjhc-hera-16217.hadoop.jd.local:46538/user/Executor#-790726793])
>  with ID 1
> {noformat}
> {noformat}
> Lost task 3.0 in stage 102.0 (TID 1963, BJHC-HERA-16217.hadoop.jd.local): 
> FetchFailed(BlockManagerId(1, BJHC-HERA-17030.hadoop.jd.local, 7337
> ), shuffleId=5, mapId=2, reduceId=3, message=
> 2015-12-22 02:43:20 INFO org.apache.spark.shuffle.FetchFailedException: 
> /data3/yarn1/local/usercache/dd_edw/appcache/application_1450438154359_206399/blockmgr-b1fd0363-6d53-4d09-8086-adc4a13f4dc4/0f/shuffl
> e_5_2_0.index (No such file or directory)
> 2015-12-22 02:43:20 INFO at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
> 2015-12-22 02:43:20 INFO at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:84)
> 2015-12-22 02:43:20 INFO at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:84)
> 2015-12-22 02:43:20 INFO at 
> scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> 2015-12-22 02:43:20 INFO at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> 2015-12-22 02:43:20 INFO at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> 2015-12-22 02:43:20 INFO at 
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> 2015-12-22 02:43:20 INFO at 
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> 2015-12-22 02:43:20 INFO at 
> org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:154)
> 2015-12-22 02:43:20 INFO at 
> org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:149)
> 2015-12-22 02:43:20 INFO at 
> org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:640)
> 2015-12-22 02:43:20 INFO at 
> org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:640)
> 2015-12-22 02:43:20 INFO at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> 2015-12-22 02:43:20 INFO at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> {noformat}
> As the task log show, the executor id of  *BJHC-HERA-16217.hadoop.jd.local* 
> is the same as *BJHC-HERA-17030.hadoop.jd.local*. So, it is confusion and 
> cause FetchFailedException.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-12864) initialize executorIdCounter after ApplicationMaster killed for max number of executor failures reached

Reply via email to