Hi Pala, Do you have access to your YARN NodeManager logs? Are you able to check whether they report killing any containers for exceeding memory limits?
-Sandy On Tue, Nov 18, 2014 at 1:54 PM, Pala M Muthaia <mchett...@rocketfuelinc.com > wrote: > Hi, > > I am using Spark 1.0.1 on Yarn 2.5, and doing everything through spark > shell. > > I am running a job that essentially reads a bunch of HBase keys, looks up > HBase data, and performs some filtering and aggregation. The job works fine > in smaller datasets, but when i try to execute on the full dataset, the job > never completes. The few symptoms i notice are: > > a. The job shows progress for a while and then starts throwing lots of the > following errors: > > 2014-11-18 00:18:20,020 [spark-akka.actor.default-dispatcher-67] INFO > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend - *Executor > 906 disconnected, so removing it* > 2014-11-18 00:18:20,020 [spark-akka.actor.default-dispatcher-67] ERROR > org.apache.spark.scheduler.cluster.YarnClientClusterScheduler - *Lost > executor 906 on <machine name>: remote Akka client disassociated* > > 2014-11-18 16:52:02,283 [spark-akka.actor.default-dispatcher-22] WARN > org.apache.spark.storage.BlockManagerMasterActor - *Removing > BlockManager BlockManagerId(9186, <machine name>, 54600, 0) with no recent > heart beats: 82313ms exceeds 45000ms* > > Looking at the logs, the job never recovers from these errors, and > continues to show errors about lost executors and launching new executors, > and this just continues for a long time. > > Could this be because the executors are running out of memory? > > In terms of memory usage, the intermediate data could be large (after the > HBase lookup), but partial and fully aggregated data set size should be > quite small - essentially a bunch of ids and counts (< 1 mil in total). > > > > b. In the Spark UI, i am seeing the following errors (redacted for > brevity), not sure if they are transient or real issue: > > java.net.SocketTimeoutException (java.net.SocketTimeoutException: Read timed > out} > ... > org.apache.spark.util.Utils$.fetchFile(Utils.scala:349) > org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:330) > org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:328) > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) > ... > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > java.lang.Thread.run(Thread.java:724) > > > > > I was trying to get more data to investigate but haven't been able to > figure out how to enable logging on the executors. The Spark UI appears > stuck and i only see driver side logs in the jobhistory directory specified > in the job. > > > Thanks, > pala > > >