Hi, I am using Spark 1.0.1 on Yarn 2.5, and doing everything through spark shell.
I am running a job that essentially reads a bunch of HBase keys, looks up HBase data, and performs some filtering and aggregation. The job works fine in smaller datasets, but when i try to execute on the full dataset, the job never completes. The few symptoms i notice are: a. The job shows progress for a while and then starts throwing lots of the following errors: 2014-11-18 00:18:20,020 [spark-akka.actor.default-dispatcher-67] INFO org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend - *Executor 906 disconnected, so removing it* 2014-11-18 00:18:20,020 [spark-akka.actor.default-dispatcher-67] ERROR org.apache.spark.scheduler.cluster.YarnClientClusterScheduler - *Lost executor 906 on <machine name>: remote Akka client disassociated* 2014-11-18 16:52:02,283 [spark-akka.actor.default-dispatcher-22] WARN org.apache.spark.storage.BlockManagerMasterActor - *Removing BlockManager BlockManagerId(9186, <machine name>, 54600, 0) with no recent heart beats: 82313ms exceeds 45000ms* Looking at the logs, the job never recovers from these errors, and continues to show errors about lost executors and launching new executors, and this just continues for a long time. Could this be because the executors are running out of memory? In terms of memory usage, the intermediate data could be large (after the HBase lookup), but partial and fully aggregated data set size should be quite small - essentially a bunch of ids and counts (< 1 mil in total). b. In the Spark UI, i am seeing the following errors (redacted for brevity), not sure if they are transient or real issue: java.net.SocketTimeoutException (java.net.SocketTimeoutException: Read timed out} ... org.apache.spark.util.Utils$.fetchFile(Utils.scala:349) org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:330) org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:328) scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) ... java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:724) I was trying to get more data to investigate but haven't been able to figure out how to enable logging on the executors. The Spark UI appears stuck and i only see driver side logs in the jobhistory directory specified in the job. Thanks, pala