Lost executors

Pala M Muthaia Tue, 18 Nov 2014 13:54:40 -0800

Hi,

I am using Spark 1.0.1 on Yarn 2.5, and doing everything through spark
shell.


I am running a job that essentially reads a bunch of HBase keys, looks up
HBase data, and performs some filtering and aggregation. The job works fine
in smaller datasets, but when i try to execute on the full dataset, the job
never completes. The few symptoms i notice are:

a. The job shows progress for a while and then starts throwing lots of the
following errors:

2014-11-18 00:18:20,020 [spark-akka.actor.default-dispatcher-67] INFO
 org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend - *Executor
906 disconnected, so removing it*
2014-11-18 00:18:20,020 [spark-akka.actor.default-dispatcher-67] ERROR
org.apache.spark.scheduler.cluster.YarnClientClusterScheduler - *Lost
executor 906 on <machine name>: remote Akka client disassociated*

2014-11-18 16:52:02,283 [spark-akka.actor.default-dispatcher-22] WARN
 org.apache.spark.storage.BlockManagerMasterActor - *Removing BlockManager
BlockManagerId(9186, <machine name>, 54600, 0) with no recent heart beats:
82313ms exceeds 45000ms*

Looking at the logs, the job never recovers from these errors, and
continues to show errors about lost executors and launching new executors,
and this just continues for a long time.

Could this be because the executors are running out of memory?

In terms of memory usage, the intermediate data could be large (after the
HBase lookup), but partial and fully aggregated data set size should be
quite small - essentially a bunch of ids and counts (< 1 mil in total).



b. In the Spark UI, i am seeing the following errors (redacted for
brevity), not sure if they are transient or real issue:

java.net.SocketTimeoutException (java.net.SocketTimeoutException: Read
timed out}
...
org.apache.spark.util.Utils$.fetchFile(Utils.scala:349)
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:330)
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:328)
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
...
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:724)




I was trying to get more data to investigate but haven't been able to
figure out how to enable logging on the executors. The Spark UI appears
stuck and i only see driver side logs in the jobhistory directory specified
in the job.


Thanks,
pala

Lost executors

Reply via email to