[ 
https://issues.apache.org/jira/browse/SPARK-19528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15864814#comment-15864814
 ] 

Shixiong Zhu commented on SPARK-19528:
--------------------------------------

This error is because the executor cannot connect to the external shuffle 
service. Did you check the logs of this external shuffle service to make sure 
this is not because of the network issue?

> external shuffle service would close while still have request from executor 
> when dynamic allocation is enabled 
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-19528
>                 URL: https://issues.apache.org/jira/browse/SPARK-19528
>             Project: Spark
>          Issue Type: Bug
>          Components: Block Manager, Shuffle, Spark Core
>    Affects Versions: 1.6.2
>         Environment: Hadoop2.7.1
> spark1.6.2
> hive2.2
>            Reporter: KaiXu
>
> when dynamic allocation is enabled, the external shuffle service is used for 
> maintain the unfinished status between executors. So the external shuffle 
> service should not close before the executor while still have request from 
> executor.
> container's log:
> 17/02/09 08:30:46 INFO executor.CoarseGrainedExecutorBackend: Connecting to 
> driver: spark://CoarseGrainedScheduler@192.168.1.1:41867
> 17/02/09 08:30:46 INFO executor.CoarseGrainedExecutorBackend: Successfully 
> registered with driver
> 17/02/09 08:30:46 INFO executor.Executor: Starting executor ID 75 on host 
> hsx-node8
> 17/02/09 08:30:46 INFO util.Utils: Successfully started service 
> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 40374.
> 17/02/09 08:30:46 INFO netty.NettyBlockTransferService: Server created on 
> 40374
> 17/02/09 08:30:46 INFO storage.BlockManager: external shuffle service port = 
> 7337
> 17/02/09 08:30:46 INFO storage.BlockManagerMaster: Trying to register 
> BlockManager
> 17/02/09 08:30:46 INFO storage.BlockManagerMaster: Registered BlockManager
> 17/02/09 08:30:46 INFO storage.BlockManager: Registering executor with local 
> external shuffle service.
> 17/02/09 08:30:51 ERROR client.TransportResponseHandler: Still have 1 
> requests outstanding when connection from hsx-node8/192.168.1.8:7337 is closed
> 17/02/09 08:30:51 ERROR storage.BlockManager: Failed to connect to external 
> shuffle server, will retry 2 more times after waiting 5 seconds...
> java.lang.RuntimeException: java.util.concurrent.TimeoutException: Timeout 
> waiting for task.
>       at 
> org.spark-project.guava.base.Throwables.propagate(Throwables.java:160)
>       at 
> org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:278)
>       at 
> org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:144)
>       at 
> org.apache.spark.storage.BlockManager$$anonfun$registerWithExternalShuffleServer$1.apply$mcVI$sp(BlockManager.scala:218)
>       at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
>       at 
> org.apache.spark.storage.BlockManager.registerWithExternalShuffleServer(BlockManager.scala:215)
>       at 
> org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:201)
>       at org.apache.spark.executor.Executor.<init>(Executor.scala:86)
>       at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:83)
>       at 
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116)
>       at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
>       at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>       at 
> org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>       at java.lang.Thread.run(Thread.java:745)
> Caused by: java.util.concurrent.TimeoutException: Timeout waiting for task.
>       at 
> org.spark-project.guava.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:276)
>       at 
> org.spark-project.guava.util.concurrent.AbstractFuture.get(AbstractFuture.java:96)
>       at 
> org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:274)
>       ... 14 more
> 17/02/09 08:31:01 ERROR storage.BlockManager: Failed to connect to external 
> shuffle server, will retry 1 more times after waiting 5 seconds...
> nodemanager's log:
> 2017-02-09 08:30:48,836 INFO 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed 
> completed containers from NM context: [container_1486564603520_0097_01_000005]
> 2017-02-09 08:31:12,122 WARN 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code 
> from container container_1486564603520_0096_01_000071 is : 1
> 2017-02-09 08:31:12,122 WARN 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exception 
> from container-launch with container ID: 
> container_1486564603520_0096_01_000071 and exit code: 1
> ExitCodeException exitCode=1:
>         at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
>         at org.apache.hadoop.util.Shell.run(Shell.java:456)
>         at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> 2017-02-09 08:31:12,122 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exception from 
> container-launch.
> 2017-02-09 08:31:12,122 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Container id: 
> container_1486564603520_0096_01_000071
> 2017-02-09 08:31:12,122 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exit code: 1
> 2017-02-09 08:31:12,122 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Stack trace: 
> ExitCodeException exitCode=1:
> 2017-02-09 08:31:12,122 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at 
> org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
> 2017-02-09 08:31:12,122 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at 
> org.apache.hadoop.util.Shell.run(Shell.java:456)
> 2017-02-09 08:31:12,122 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
> 2017-02-09 08:31:12,122 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
> 2017-02-09 08:31:12,122 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
> 2017-02-09 08:31:12,122 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
> 2017-02-09 08:31:12,122 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at 
> java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 2017-02-09 08:31:12,122 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> 2017-02-09 08:31:12,122 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> 2017-02-09 08:31:12,122 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at 
> java.lang.Thread.run(Thread.java:745)
> 2017-02-09 08:31:12,122 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Container exited with a non-zero exit code 1
> 2017-02-09 08:31:12,122 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Container container_1486564603520_0096_01_000071 transitioned from RUNNING 
> to EXITED_WITH_FAILURE



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to