Hi, I believe this is some kind of timeout problem but can't figure out how to increase it. I am running spark 1.2.0 on yarn (all from cdh 5.3.0). I submit a python task which first loads big RDD from hbase - I can see in the screen output all executors fire up then no more logging output for next two minutes after which I get plenty of 15/01/16 17:35:16 ERROR cluster.YarnClientClusterScheduler: Lost executor 7 on node01: remote Akka client disassociated15/01/16 17:35:16 INFO scheduler.TaskSetManager: Re-queueing tasks for 7 from TaskSet 1.015/01/16 17:35:16 WARN scheduler.TaskSetManager: Lost task 32.0 in stage 1.0 (TID 17, node01): ExecutorLostFailure (executor 7 lost)15/01/16 17:35:16 WARN scheduler.TaskSetManager: Lost task 34.0 in stage 1.0 (TID 25, node01): ExecutorLostFailure (executor 7 lost) this points to some timeout ~120secs while the nodes are loading the big RDD? any ideas how to get around it? fyi I already use following options without any success: spark.core.connection.ack.wait.timeout: 600 spark.akka.timeout: 1000
thanks,Antony.