Hi,

We have a cluster setup with spark 1.0.2 running 4 workers and 1 master
with 64G RAM for each. In the sparkContext we specify 32G executor memory.
However, as long as the task running longer than approximate 15 mins, all
the executors are lost just like some sort of timeout no matter if the task
is using up the memory. We tried to increase the spark.akka.timeout,
spark.akka.lookupTimeout, and spark.worker.timeout, but still no luck.
Besides, even we just start a sparkContext and sit there instead of "stop"
it, it will still error out with the exception below:

[error] o.a.s.s.TaskSchedulerImpl - Lost executor 0 on XXX06: remote Akka
client disassociated
[error] o.a.s.n.ConnectionManager - Corresponding SendingConnection to
ConnectionManagerId(XXX06.local,34307) not found
[error] o.a.s.s.TaskSchedulerImpl - Lost executor 2 on XXX08: remote Akka
client disassociated
[error] o.a.s.s.TaskSchedulerImpl - Lost executor 1 on XXX07: remote Akka
client disassociated
[error] o.a.s.n.SendingConnection - Exception while reading
SendingConnection to ConnectionManagerId(XXX1,56639)
java.nio.channels.ClosedChannelException: null
at sun.nio.ch.SocketChannelImpl.ensureReadOpen(SocketChannelImpl.java:252)
~[na:1.7.0_60]
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:295)
~[na:1.7.0_60]
at org.apache.spark.network.SendingConnection.read(Connection.scala:390)
~[spark-core_2.10-1.1.0.jar:1.1.0]
at
org.apache.spark.network.ConnectionManager$$anon$7.run(ConnectionManager.scala:199)
[spark-core_2.10-1.1.0.jar:1.1.0]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
[na:1.7.0_60]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
[na:1.7.0_60]
[error] o.a.s.n.SendingConnection - Exception while reading
SendingConnection to ConnectionManagerId(XXX08.local,39914)
java.nio.channels.ClosedChannelException: null
at sun.nio.ch.SocketChannelImpl.ensureReadOpen(SocketChannelImpl.java:252)
~[na:1.7.0_60]
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:295)
~[na:1.7.0_60]
at org.apache.spark.network.SendingConnection.read(Connection.scala:390)
~[spark-core_2.10-1.1.0.jar:1.1.0]
at
org.apache.spark.network.ConnectionManager$$anon$7.run(ConnectionManager.scala:199)
[spark-core_2.10-1.1.0.jar:1.1.0]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
[na:1.7.0_60]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
[na:1.7.0_60]
[error] a.r.EndpointWriter - AssociationError
[akka.tcp://sparkDriver@losbornev.local:35540] ->
[akka.tcp://sparkExecutor@XXX06:60653]: Error [Association failed with
[akka.tcp://sparkExecutor@XXX06:60653]] [
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://sparkExecutor@XXX06:60653]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: XXX06/10.40.31.51:60653
]
[error] a.r.EndpointWriter - AssociationError
[akka.tcp://sparkDriver@losbornev.local:35540] ->
[akka.tcp://sparkExecutor@XXX06:61000]: Error [Association failed with
[akka.tcp://sparkExecutor@XXX06:61000]] [
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://sparkExecutor@XXX06:61000]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: XXX06/10.40.31.51:61000
]
[error] a.r.EndpointWriter - AssociationError
[akka.tcp://sparkDriver@losbornev.local:35540] ->
[akka.tcp://sparkExecutor@XXX08:52949]: Error [Association failed with
[akka.tcp://sparkExecutor@XXX08:52949]] [
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://sparkExecutor@XXX08:52949]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: XXX08/10.40.31.53:52949
]
[error] a.r.EndpointWriter - AssociationError
[akka.tcp://sparkDriver@losbornev.local:35540] ->
[akka.tcp://sparkExecutor@XXX08:36726]: Error [Association failed with
[akka.tcp://sparkExecutor@XXX08:36726]] [
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://sparkExecutor@XXX08:36726]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: XXX08/10.40.31.53:36726
]
[error] a.r.EndpointWriter - AssociationError
[akka.tcp://sparkDriver@losbornev.local:35540] ->
[akka.tcp://sparkExecutor@XXX07:46516]: Error [Association failed with
[akka.tcp://sparkExecutor@XXX07:46516]] [
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://sparkExecutor@XXX07:46516]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: XXX07/10.40.31.52:46516
]
[error] a.r.EndpointWriter - AssociationError
[akka.tcp://sparkDriver@losbornev.local:35540] ->
[akka.tcp://sparkExecutor@XXX07:48160]: Error [Association failed with
[akka.tcp://sparkExecutor@XXX07:48160]] [
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://sparkExecutor@XXX07:48160]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: XXX07/10.40.31.52:48160
]
[error] a.r.EndpointWriter - AssociationError
[akka.tcp://sparkDriver@losbornev.local:35540] ->
[akka.tcp://sparkExecutor@XXX06:60653]: Error [Association failed with
[akka.tcp://sparkExecutor@XXX06:60653]] [
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://sparkExecutor@XXX06:60653]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: XXX06/10.40.31.51:60653
]
[error] a.r.EndpointWriter - AssociationError
[akka.tcp://sparkDriver@losbornev.local:35540] ->
[akka.tcp://sparkExecutor@XXX06:61000]: Error [Association failed with
[akka.tcp://sparkExecutor@XXX06:61000]] [
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://sparkExecutor@XXX06:61000]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: XXX06/10.40.31.51:61000
]
[error] a.r.EndpointWriter - AssociationError
[akka.tcp://sparkDriver@losbornev.local:35540] ->
[akka.tcp://sparkExecutor@XXX07:46516]: Error [Association failed with
[akka.tcp://sparkExecutor@XXX07:46516]] [
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://sparkExecutor@XXX07:46516]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: XXX07/10.40.31.52:46516
]
[error] a.r.EndpointWriter - AssociationError
[akka.tcp://sparkDriver@losbornev.local:35540] ->
[akka.tcp://sparkExecutor@XXX08:52949]: Error [Association failed with
[akka.tcp://sparkExecutor@XXX08:52949]] [
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://sparkExecutor@XXX08:52949]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: XXX08/10.40.31.53:52949
]
[error] a.r.EndpointWriter - AssociationError
[akka.tcp://sparkDriver@losbornev.local:35540] ->
[akka.tcp://sparkExecutor@XXX08:36726]: Error [Association failed with
[akka.tcp://sparkExecutor@XXX08:36726]] [
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://sparkExecutor@XXX08:36726]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: XXX08/10.40.31.53:36726
]
[error] a.r.EndpointWriter - AssociationError
[akka.tcp://sparkDriver@losbornev.local:35540] ->
[akka.tcp://sparkExecutor@XXX07:48160]: Error [Association failed with
[akka.tcp://sparkExecutor@XXX07:48160]] [
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://sparkExecutor@XXX07:48160]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: XXX07/10.40.31.52:48160
]
[error] a.r.EndpointWriter - AssociationError
[akka.tcp://sparkDriver@losbornev.local:35540] ->
[akka.tcp://sparkExecutor@XXX06:61000]: Error [Association failed with
[akka.tcp://sparkExecutor@XXX06:61000]] [
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://sparkExecutor@XXX06:61000]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: XXX06/10.40.31.51:61000
]
[error] a.r.EndpointWriter - AssociationError
[akka.tcp://sparkDriver@losbornev.local:35540] ->
[akka.tcp://sparkExecutor@XXX06:60653]: Error [Association failed with
[akka.tcp://sparkExecutor@XXX06:60653]] [
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://sparkExecutor@XXX06:60653]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: XXX06/10.40.31.51:60653
]
[error] a.r.EndpointWriter - AssociationError
[akka.tcp://sparkDriver@losbornev.local:35540] ->
[akka.tcp://sparkExecutor@XXX07:46516]: Error [Association failed with
[akka.tcp://sparkExecutor@XXX07:46516]] [
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://sparkExecutor@XXX07:46516]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: XXX07/10.40.31.52:46516
]
[error] a.r.EndpointWriter - AssociationError
[akka.tcp://sparkDriver@losbornev.local:35540] ->
[akka.tcp://sparkExecutor@XXX08:52949]: Error [Association failed with
[akka.tcp://sparkExecutor@XXX08:52949]] [
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://sparkExecutor@XXX08:52949]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: XXX08/10.40.31.53:52949
]
[error] a.r.EndpointWriter - AssociationError
[akka.tcp://sparkDriver@losbornev.local:35540] ->
[akka.tcp://sparkExecutor@XXX07:48160]: Error [Association failed with
[akka.tcp://sparkExecutor@XXX07:48160]] [
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://sparkExecutor@XXX07:48160]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: XXX07/10.40.31.52:48160
]
[error] a.r.EndpointWriter - AssociationError
[akka.tcp://sparkDriver@losbornev.local:35540] ->
[akka.tcp://sparkExecutor@XXX08:36726]: Error [Association failed with
[akka.tcp://sparkExecutor@XXX08:36726]] [
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://sparkExecutor@XXX08:36726]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: XXX08/10.40.31.53:36726
]


Thanks in advance!

Reply via email to