Hi, We have a cluster setup with spark 1.0.2 running 4 workers and 1 master with 64G RAM for each. In the sparkContext we specify 32G executor memory. However, as long as the task running longer than approximate 15 mins, all the executors are lost just like some sort of timeout no matter if the task is using up the memory. We tried to increase the spark.akka.timeout, spark.akka.lookupTimeout, and spark.worker.timeout, but still no luck. Besides, even we just start a sparkContext and sit there instead of "stop" it, it will still error out with the exception below:
[error] o.a.s.s.TaskSchedulerImpl - Lost executor 0 on XXX06: remote Akka client disassociated [error] o.a.s.n.ConnectionManager - Corresponding SendingConnection to ConnectionManagerId(XXX06.local,34307) not found [error] o.a.s.s.TaskSchedulerImpl - Lost executor 2 on XXX08: remote Akka client disassociated [error] o.a.s.s.TaskSchedulerImpl - Lost executor 1 on XXX07: remote Akka client disassociated [error] o.a.s.n.SendingConnection - Exception while reading SendingConnection to ConnectionManagerId(XXX1,56639) java.nio.channels.ClosedChannelException: null at sun.nio.ch.SocketChannelImpl.ensureReadOpen(SocketChannelImpl.java:252) ~[na:1.7.0_60] at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:295) ~[na:1.7.0_60] at org.apache.spark.network.SendingConnection.read(Connection.scala:390) ~[spark-core_2.10-1.1.0.jar:1.1.0] at org.apache.spark.network.ConnectionManager$$anon$7.run(ConnectionManager.scala:199) [spark-core_2.10-1.1.0.jar:1.1.0] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_60] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_60] [error] o.a.s.n.SendingConnection - Exception while reading SendingConnection to ConnectionManagerId(XXX08.local,39914) java.nio.channels.ClosedChannelException: null at sun.nio.ch.SocketChannelImpl.ensureReadOpen(SocketChannelImpl.java:252) ~[na:1.7.0_60] at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:295) ~[na:1.7.0_60] at org.apache.spark.network.SendingConnection.read(Connection.scala:390) ~[spark-core_2.10-1.1.0.jar:1.1.0] at org.apache.spark.network.ConnectionManager$$anon$7.run(ConnectionManager.scala:199) [spark-core_2.10-1.1.0.jar:1.1.0] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_60] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_60] [error] a.r.EndpointWriter - AssociationError [akka.tcp://sparkDriver@losbornev.local:35540] -> [akka.tcp://sparkExecutor@XXX06:60653]: Error [Association failed with [akka.tcp://sparkExecutor@XXX06:60653]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkExecutor@XXX06:60653] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: XXX06/10.40.31.51:60653 ] [error] a.r.EndpointWriter - AssociationError [akka.tcp://sparkDriver@losbornev.local:35540] -> [akka.tcp://sparkExecutor@XXX06:61000]: Error [Association failed with [akka.tcp://sparkExecutor@XXX06:61000]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkExecutor@XXX06:61000] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: XXX06/10.40.31.51:61000 ] [error] a.r.EndpointWriter - AssociationError [akka.tcp://sparkDriver@losbornev.local:35540] -> [akka.tcp://sparkExecutor@XXX08:52949]: Error [Association failed with [akka.tcp://sparkExecutor@XXX08:52949]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkExecutor@XXX08:52949] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: XXX08/10.40.31.53:52949 ] [error] a.r.EndpointWriter - AssociationError [akka.tcp://sparkDriver@losbornev.local:35540] -> [akka.tcp://sparkExecutor@XXX08:36726]: Error [Association failed with [akka.tcp://sparkExecutor@XXX08:36726]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkExecutor@XXX08:36726] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: XXX08/10.40.31.53:36726 ] [error] a.r.EndpointWriter - AssociationError [akka.tcp://sparkDriver@losbornev.local:35540] -> [akka.tcp://sparkExecutor@XXX07:46516]: Error [Association failed with [akka.tcp://sparkExecutor@XXX07:46516]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkExecutor@XXX07:46516] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: XXX07/10.40.31.52:46516 ] [error] a.r.EndpointWriter - AssociationError [akka.tcp://sparkDriver@losbornev.local:35540] -> [akka.tcp://sparkExecutor@XXX07:48160]: Error [Association failed with [akka.tcp://sparkExecutor@XXX07:48160]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkExecutor@XXX07:48160] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: XXX07/10.40.31.52:48160 ] [error] a.r.EndpointWriter - AssociationError [akka.tcp://sparkDriver@losbornev.local:35540] -> [akka.tcp://sparkExecutor@XXX06:60653]: Error [Association failed with [akka.tcp://sparkExecutor@XXX06:60653]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkExecutor@XXX06:60653] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: XXX06/10.40.31.51:60653 ] [error] a.r.EndpointWriter - AssociationError [akka.tcp://sparkDriver@losbornev.local:35540] -> [akka.tcp://sparkExecutor@XXX06:61000]: Error [Association failed with [akka.tcp://sparkExecutor@XXX06:61000]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkExecutor@XXX06:61000] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: XXX06/10.40.31.51:61000 ] [error] a.r.EndpointWriter - AssociationError [akka.tcp://sparkDriver@losbornev.local:35540] -> [akka.tcp://sparkExecutor@XXX07:46516]: Error [Association failed with [akka.tcp://sparkExecutor@XXX07:46516]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkExecutor@XXX07:46516] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: XXX07/10.40.31.52:46516 ] [error] a.r.EndpointWriter - AssociationError [akka.tcp://sparkDriver@losbornev.local:35540] -> [akka.tcp://sparkExecutor@XXX08:52949]: Error [Association failed with [akka.tcp://sparkExecutor@XXX08:52949]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkExecutor@XXX08:52949] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: XXX08/10.40.31.53:52949 ] [error] a.r.EndpointWriter - AssociationError [akka.tcp://sparkDriver@losbornev.local:35540] -> [akka.tcp://sparkExecutor@XXX08:36726]: Error [Association failed with [akka.tcp://sparkExecutor@XXX08:36726]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkExecutor@XXX08:36726] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: XXX08/10.40.31.53:36726 ] [error] a.r.EndpointWriter - AssociationError [akka.tcp://sparkDriver@losbornev.local:35540] -> [akka.tcp://sparkExecutor@XXX07:48160]: Error [Association failed with [akka.tcp://sparkExecutor@XXX07:48160]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkExecutor@XXX07:48160] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: XXX07/10.40.31.52:48160 ] [error] a.r.EndpointWriter - AssociationError [akka.tcp://sparkDriver@losbornev.local:35540] -> [akka.tcp://sparkExecutor@XXX06:61000]: Error [Association failed with [akka.tcp://sparkExecutor@XXX06:61000]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkExecutor@XXX06:61000] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: XXX06/10.40.31.51:61000 ] [error] a.r.EndpointWriter - AssociationError [akka.tcp://sparkDriver@losbornev.local:35540] -> [akka.tcp://sparkExecutor@XXX06:60653]: Error [Association failed with [akka.tcp://sparkExecutor@XXX06:60653]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkExecutor@XXX06:60653] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: XXX06/10.40.31.51:60653 ] [error] a.r.EndpointWriter - AssociationError [akka.tcp://sparkDriver@losbornev.local:35540] -> [akka.tcp://sparkExecutor@XXX07:46516]: Error [Association failed with [akka.tcp://sparkExecutor@XXX07:46516]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkExecutor@XXX07:46516] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: XXX07/10.40.31.52:46516 ] [error] a.r.EndpointWriter - AssociationError [akka.tcp://sparkDriver@losbornev.local:35540] -> [akka.tcp://sparkExecutor@XXX08:52949]: Error [Association failed with [akka.tcp://sparkExecutor@XXX08:52949]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkExecutor@XXX08:52949] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: XXX08/10.40.31.53:52949 ] [error] a.r.EndpointWriter - AssociationError [akka.tcp://sparkDriver@losbornev.local:35540] -> [akka.tcp://sparkExecutor@XXX07:48160]: Error [Association failed with [akka.tcp://sparkExecutor@XXX07:48160]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkExecutor@XXX07:48160] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: XXX07/10.40.31.52:48160 ] [error] a.r.EndpointWriter - AssociationError [akka.tcp://sparkDriver@losbornev.local:35540] -> [akka.tcp://sparkExecutor@XXX08:36726]: Error [Association failed with [akka.tcp://sparkExecutor@XXX08:36726]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkExecutor@XXX08:36726] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: XXX08/10.40.31.53:36726 ] Thanks in advance!