[jira] [Updated] (SPARK-19528) external shuffle service registration timeout is very short with heavy workloads when dynamic allocation is enabled
[ https://issues.apache.org/jira/browse/SPARK-19528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] KaiXu updated SPARK-19528: -- Attachment: SPARK-19528.1.spark2.patch > external shuffle service registration timeout is very short with heavy > workloads when dynamic allocation is enabled > > > Key: SPARK-19528 > URL: https://issues.apache.org/jira/browse/SPARK-19528 > Project: Spark > Issue Type: Bug > Components: Block Manager, Shuffle, Spark Core >Affects Versions: 1.6.2, 1.6.3, 2.0.2 > Environment: Hadoop2.7.1 > spark1.6.2 > hive2.2 >Reporter: KaiXu >Priority: Major > Attachments: SPARK-19528.1.patch, SPARK-19528.1.spark2.patch > > > when dynamic allocation is enabled, the external shuffle service is used for > maintain the unfinished status between executors. So the external shuffle > service should not close before the executor while still have request from > executor. > container's log: > {noformat} > 17/02/09 08:30:46 INFO executor.CoarseGrainedExecutorBackend: Connecting to > driver: spark://CoarseGrainedScheduler@192.168.1.1:41867 > 17/02/09 08:30:46 INFO executor.CoarseGrainedExecutorBackend: Successfully > registered with driver > 17/02/09 08:30:46 INFO executor.Executor: Starting executor ID 75 on host > hsx-node8 > 17/02/09 08:30:46 INFO util.Utils: Successfully started service > 'org.apache.spark.network.netty.NettyBlockTransferService' on port 40374. > 17/02/09 08:30:46 INFO netty.NettyBlockTransferService: Server created on > 40374 > 17/02/09 08:30:46 INFO storage.BlockManager: external shuffle service port = > 7337 > 17/02/09 08:30:46 INFO storage.BlockManagerMaster: Trying to register > BlockManager > 17/02/09 08:30:46 INFO storage.BlockManagerMaster: Registered BlockManager > 17/02/09 08:30:46 INFO storage.BlockManager: Registering executor with local > external shuffle service. > 17/02/09 08:30:51 ERROR client.TransportResponseHandler: Still have 1 > requests outstanding when connection from hsx-node8/192.168.1.8:7337 is closed > 17/02/09 08:30:51 ERROR storage.BlockManager: Failed to connect to external > shuffle server, will retry 2 more times after waiting 5 seconds... > java.lang.RuntimeException: java.util.concurrent.TimeoutException: Timeout > waiting for task. > at > org.spark-project.guava.base.Throwables.propagate(Throwables.java:160) > at > org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:278) > at > org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:144) > at > org.apache.spark.storage.BlockManager$$anonfun$registerWithExternalShuffleServer$1.apply$mcVI$sp(BlockManager.scala:218) > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) > at > org.apache.spark.storage.BlockManager.registerWithExternalShuffleServer(BlockManager.scala:215) > at > org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:201) > at org.apache.spark.executor.Executor.(Executor.scala:86) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:83) > at > org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116) > at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204) > at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) > at > org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.util.concurrent.TimeoutException: Timeout waiting for task. > at > org.spark-project.guava.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:276) > at > org.spark-project.guava.util.concurrent.AbstractFuture.get(AbstractFuture.java:96) > at > org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:274) > ... 14 more > 17/02/09 08:31:01 ERROR storage.BlockManager: Failed to connect to external > shuffle server, will retry 1 more times after waiting 5 seconds... > {noformat} > nodemanager's log: > {noformat} > 2017-02-09 08:30:48,836 INFO > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed > completed containers from NM context: [container_1486564603520_0097_01_05] > 2017-02-09 08:31:12,122 WARN > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code > from container container_1486564603520_0096_01_71 is : 1 > 2017-02-09 08:31:12,122
[jira] [Updated] (SPARK-19528) external shuffle service registration timeout is very short with heavy workloads when dynamic allocation is enabled
[ https://issues.apache.org/jira/browse/SPARK-19528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] KaiXu updated SPARK-19528: -- Affects Version/s: 2.0.2 > external shuffle service registration timeout is very short with heavy > workloads when dynamic allocation is enabled > > > Key: SPARK-19528 > URL: https://issues.apache.org/jira/browse/SPARK-19528 > Project: Spark > Issue Type: Bug > Components: Block Manager, Shuffle, Spark Core >Affects Versions: 1.6.2, 1.6.3, 2.0.2 > Environment: Hadoop2.7.1 > spark1.6.2 > hive2.2 >Reporter: KaiXu >Priority: Major > Attachments: SPARK-19528.1.patch > > > when dynamic allocation is enabled, the external shuffle service is used for > maintain the unfinished status between executors. So the external shuffle > service should not close before the executor while still have request from > executor. > container's log: > {noformat} > 17/02/09 08:30:46 INFO executor.CoarseGrainedExecutorBackend: Connecting to > driver: spark://CoarseGrainedScheduler@192.168.1.1:41867 > 17/02/09 08:30:46 INFO executor.CoarseGrainedExecutorBackend: Successfully > registered with driver > 17/02/09 08:30:46 INFO executor.Executor: Starting executor ID 75 on host > hsx-node8 > 17/02/09 08:30:46 INFO util.Utils: Successfully started service > 'org.apache.spark.network.netty.NettyBlockTransferService' on port 40374. > 17/02/09 08:30:46 INFO netty.NettyBlockTransferService: Server created on > 40374 > 17/02/09 08:30:46 INFO storage.BlockManager: external shuffle service port = > 7337 > 17/02/09 08:30:46 INFO storage.BlockManagerMaster: Trying to register > BlockManager > 17/02/09 08:30:46 INFO storage.BlockManagerMaster: Registered BlockManager > 17/02/09 08:30:46 INFO storage.BlockManager: Registering executor with local > external shuffle service. > 17/02/09 08:30:51 ERROR client.TransportResponseHandler: Still have 1 > requests outstanding when connection from hsx-node8/192.168.1.8:7337 is closed > 17/02/09 08:30:51 ERROR storage.BlockManager: Failed to connect to external > shuffle server, will retry 2 more times after waiting 5 seconds... > java.lang.RuntimeException: java.util.concurrent.TimeoutException: Timeout > waiting for task. > at > org.spark-project.guava.base.Throwables.propagate(Throwables.java:160) > at > org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:278) > at > org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:144) > at > org.apache.spark.storage.BlockManager$$anonfun$registerWithExternalShuffleServer$1.apply$mcVI$sp(BlockManager.scala:218) > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) > at > org.apache.spark.storage.BlockManager.registerWithExternalShuffleServer(BlockManager.scala:215) > at > org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:201) > at org.apache.spark.executor.Executor.(Executor.scala:86) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:83) > at > org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116) > at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204) > at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) > at > org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.util.concurrent.TimeoutException: Timeout waiting for task. > at > org.spark-project.guava.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:276) > at > org.spark-project.guava.util.concurrent.AbstractFuture.get(AbstractFuture.java:96) > at > org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:274) > ... 14 more > 17/02/09 08:31:01 ERROR storage.BlockManager: Failed to connect to external > shuffle server, will retry 1 more times after waiting 5 seconds... > {noformat} > nodemanager's log: > {noformat} > 2017-02-09 08:30:48,836 INFO > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed > completed containers from NM context: [container_1486564603520_0097_01_05] > 2017-02-09 08:31:12,122 WARN > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code > from container container_1486564603520_0096_01_71 is : 1 > 2017-02-09 08:31:12,122 WARN >
[jira] [Updated] (SPARK-19528) external shuffle service registration timeout is very short with heavy workloads when dynamic allocation is enabled
[ https://issues.apache.org/jira/browse/SPARK-19528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-19528: -- Target Version/s: (was: 1.6.2, 1.6.3) > external shuffle service registration timeout is very short with heavy > workloads when dynamic allocation is enabled > > > Key: SPARK-19528 > URL: https://issues.apache.org/jira/browse/SPARK-19528 > Project: Spark > Issue Type: Bug > Components: Block Manager, Shuffle, Spark Core >Affects Versions: 1.6.2, 1.6.3 > Environment: Hadoop2.7.1 > spark1.6.2 > hive2.2 >Reporter: KaiXu > Attachments: SPARK-19528.1.patch > > > when dynamic allocation is enabled, the external shuffle service is used for > maintain the unfinished status between executors. So the external shuffle > service should not close before the executor while still have request from > executor. > container's log: > {noformat} > 17/02/09 08:30:46 INFO executor.CoarseGrainedExecutorBackend: Connecting to > driver: spark://CoarseGrainedScheduler@192.168.1.1:41867 > 17/02/09 08:30:46 INFO executor.CoarseGrainedExecutorBackend: Successfully > registered with driver > 17/02/09 08:30:46 INFO executor.Executor: Starting executor ID 75 on host > hsx-node8 > 17/02/09 08:30:46 INFO util.Utils: Successfully started service > 'org.apache.spark.network.netty.NettyBlockTransferService' on port 40374. > 17/02/09 08:30:46 INFO netty.NettyBlockTransferService: Server created on > 40374 > 17/02/09 08:30:46 INFO storage.BlockManager: external shuffle service port = > 7337 > 17/02/09 08:30:46 INFO storage.BlockManagerMaster: Trying to register > BlockManager > 17/02/09 08:30:46 INFO storage.BlockManagerMaster: Registered BlockManager > 17/02/09 08:30:46 INFO storage.BlockManager: Registering executor with local > external shuffle service. > 17/02/09 08:30:51 ERROR client.TransportResponseHandler: Still have 1 > requests outstanding when connection from hsx-node8/192.168.1.8:7337 is closed > 17/02/09 08:30:51 ERROR storage.BlockManager: Failed to connect to external > shuffle server, will retry 2 more times after waiting 5 seconds... > java.lang.RuntimeException: java.util.concurrent.TimeoutException: Timeout > waiting for task. > at > org.spark-project.guava.base.Throwables.propagate(Throwables.java:160) > at > org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:278) > at > org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:144) > at > org.apache.spark.storage.BlockManager$$anonfun$registerWithExternalShuffleServer$1.apply$mcVI$sp(BlockManager.scala:218) > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) > at > org.apache.spark.storage.BlockManager.registerWithExternalShuffleServer(BlockManager.scala:215) > at > org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:201) > at org.apache.spark.executor.Executor.(Executor.scala:86) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:83) > at > org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116) > at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204) > at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) > at > org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.util.concurrent.TimeoutException: Timeout waiting for task. > at > org.spark-project.guava.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:276) > at > org.spark-project.guava.util.concurrent.AbstractFuture.get(AbstractFuture.java:96) > at > org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:274) > ... 14 more > 17/02/09 08:31:01 ERROR storage.BlockManager: Failed to connect to external > shuffle server, will retry 1 more times after waiting 5 seconds... > {noformat} > nodemanager's log: > {noformat} > 2017-02-09 08:30:48,836 INFO > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed > completed containers from NM context: [container_1486564603520_0097_01_05] > 2017-02-09 08:31:12,122 WARN > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code > from container container_1486564603520_0096_01_71 is : 1 > 2017-02-09 08:31:12,122 WARN >
[jira] [Updated] (SPARK-19528) external shuffle service registration timeout is very short with heavy workloads when dynamic allocation is enabled
[ https://issues.apache.org/jira/browse/SPARK-19528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] KaiXu updated SPARK-19528: -- Target Version/s: 1.6.3, 1.6.2 Summary: external shuffle service registration timeout is very short with heavy workloads when dynamic allocation is enabled (was: external shuffle service would close while still have request from executor when dynamic allocation is enabled ) > external shuffle service registration timeout is very short with heavy > workloads when dynamic allocation is enabled > > > Key: SPARK-19528 > URL: https://issues.apache.org/jira/browse/SPARK-19528 > Project: Spark > Issue Type: Bug > Components: Block Manager, Shuffle, Spark Core >Affects Versions: 1.6.2, 1.6.3 > Environment: Hadoop2.7.1 > spark1.6.2 > hive2.2 >Reporter: KaiXu > Attachments: SPARK-19528.1.patch > > > when dynamic allocation is enabled, the external shuffle service is used for > maintain the unfinished status between executors. So the external shuffle > service should not close before the executor while still have request from > executor. > container's log: > {noformat} > 17/02/09 08:30:46 INFO executor.CoarseGrainedExecutorBackend: Connecting to > driver: spark://CoarseGrainedScheduler@192.168.1.1:41867 > 17/02/09 08:30:46 INFO executor.CoarseGrainedExecutorBackend: Successfully > registered with driver > 17/02/09 08:30:46 INFO executor.Executor: Starting executor ID 75 on host > hsx-node8 > 17/02/09 08:30:46 INFO util.Utils: Successfully started service > 'org.apache.spark.network.netty.NettyBlockTransferService' on port 40374. > 17/02/09 08:30:46 INFO netty.NettyBlockTransferService: Server created on > 40374 > 17/02/09 08:30:46 INFO storage.BlockManager: external shuffle service port = > 7337 > 17/02/09 08:30:46 INFO storage.BlockManagerMaster: Trying to register > BlockManager > 17/02/09 08:30:46 INFO storage.BlockManagerMaster: Registered BlockManager > 17/02/09 08:30:46 INFO storage.BlockManager: Registering executor with local > external shuffle service. > 17/02/09 08:30:51 ERROR client.TransportResponseHandler: Still have 1 > requests outstanding when connection from hsx-node8/192.168.1.8:7337 is closed > 17/02/09 08:30:51 ERROR storage.BlockManager: Failed to connect to external > shuffle server, will retry 2 more times after waiting 5 seconds... > java.lang.RuntimeException: java.util.concurrent.TimeoutException: Timeout > waiting for task. > at > org.spark-project.guava.base.Throwables.propagate(Throwables.java:160) > at > org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:278) > at > org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:144) > at > org.apache.spark.storage.BlockManager$$anonfun$registerWithExternalShuffleServer$1.apply$mcVI$sp(BlockManager.scala:218) > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) > at > org.apache.spark.storage.BlockManager.registerWithExternalShuffleServer(BlockManager.scala:215) > at > org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:201) > at org.apache.spark.executor.Executor.(Executor.scala:86) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:83) > at > org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116) > at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204) > at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) > at > org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.util.concurrent.TimeoutException: Timeout waiting for task. > at > org.spark-project.guava.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:276) > at > org.spark-project.guava.util.concurrent.AbstractFuture.get(AbstractFuture.java:96) > at > org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:274) > ... 14 more > 17/02/09 08:31:01 ERROR storage.BlockManager: Failed to connect to external > shuffle server, will retry 1 more times after waiting 5 seconds... > {noformat} > nodemanager's log: > {noformat} > 2017-02-09 08:30:48,836 INFO > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed > completed containers from NM context: [container_1486564603520_0097_01_05] > 2017-02-09