Hi,

Plz give a try by changing the worker memory such that worker memory>executor 
memory
 
Thanks & Regards, 
Meethu M


On Friday, 22 August 2014 5:18 PM, Yadid Ayzenberg <ya...@media.mit.edu> wrote:
 


Hi all,

I have a spark cluster of 30 machines, 16GB / 8 cores on each running in 
standalone mode. Previously my application was working well ( several 
RDDs the largest being around 50G).
When I started processing larger amounts of data (RDDs of 100G) my app 
is losing executors. Im currently just loading them from a database, 
rePartitioning and persisting to disk (with replication x2)
I have spark.executor.memory= 9G, memoryFraction = 0.5, 
spark.worker.timeout =120, spark.akka.askTimeout=30, 
spark.storage.blockManagerHeartBeatMs=30000.
I haven't change the default of my worker memory so its at 512m (should 
this be larger) ?

I've been getting the following messages from my app:

  [error] o.a.s.s.TaskSchedulerImpl - Lost executor 3 on myserver1: 
worker lost
[error] o.a.s.s.TaskSchedulerImpl - Lost executor 13 on myserver2: 
Unknown executor exit code (137) (died from signal 9?)
[error] a.r.EndpointWriter - AssociationError 
[akka.tcp://spark@master:59406] -> 
[akka.tcp://sparkExecutor@myserver2:32955]: Error [Association failed 
with [akka.tcp://sparkExecutor@myserver2:32955]] [
akka.remote.EndpointAssociationException: Association failed with 
[akka.tcp://sparkexecu...@myserver2.com:32955]
Caused by: 
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: 
Connection refused: myserver2/198.18.102.160:32955
]
[error] a.r.EndpointWriter - AssociationError 
[akka.tcp://spark@master:59406] -> [akka.tcp://spark@myserver1:53855]: 
Error [Association failed with [akka.tcp://spark@myserver1:53855]] [
akka.remote.EndpointAssociationException: Association failed with 
[akka.tcp://spark@myserver1:53855]
Caused by: 
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: 
Connection refused: myserver1/198.18.102.160:53855
]

The worker logs and executor logs do not contain errors. Any ideas what 
the problem is ?

Yadid


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to