Hi all,

I have a spark cluster of 30 machines, 16GB / 8 cores on each running in standalone mode. Previously my application was working well ( several RDDs the largest being around 50G). When I started processing larger amounts of data (RDDs of 100G) my app is losing executors. Im currently just loading them from a database, rePartitioning and persisting to disk (with replication x2) I have spark.executor.memory= 9G, memoryFraction = 0.5, spark.worker.timeout =120, spark.akka.askTimeout=30, spark.storage.blockManagerHeartBeatMs=30000. I haven't change the default of my worker memory so its at 512m (should this be larger) ?

I've been getting the following messages from my app:

[error] o.a.s.s.TaskSchedulerImpl - Lost executor 3 on myserver1: worker lost [error] o.a.s.s.TaskSchedulerImpl - Lost executor 13 on myserver2: Unknown executor exit code (137) (died from signal 9?) [error] a.r.EndpointWriter - AssociationError [akka.tcp://spark@master:59406] -> [akka.tcp://sparkExecutor@myserver2:32955]: Error [Association failed with [akka.tcp://sparkExecutor@myserver2:32955]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkexecu...@myserver2.com:32955] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: myserver2/198.18.102.160:32955
]
[error] a.r.EndpointWriter - AssociationError [akka.tcp://spark@master:59406] -> [akka.tcp://spark@myserver1:53855]: Error [Association failed with [akka.tcp://spark@myserver1:53855]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://spark@myserver1:53855] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: myserver1/198.18.102.160:53855
]

The worker logs and executor logs do not contain errors. Any ideas what the problem is ?

Yadid

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to