Hi Pala,

Do you have access to your YARN NodeManager logs?  Are you able to check
whether they report killing any containers for exceeding memory limits?

-Sandy

On Tue, Nov 18, 2014 at 1:54 PM, Pala M Muthaia <mchett...@rocketfuelinc.com
> wrote:

> Hi,
>
> I am using Spark 1.0.1 on Yarn 2.5, and doing everything through spark
> shell.
>
> I am running a job that essentially reads a bunch of HBase keys, looks up
> HBase data, and performs some filtering and aggregation. The job works fine
> in smaller datasets, but when i try to execute on the full dataset, the job
> never completes. The few symptoms i notice are:
>
> a. The job shows progress for a while and then starts throwing lots of the
> following errors:
>
> 2014-11-18 00:18:20,020 [spark-akka.actor.default-dispatcher-67] INFO
>  org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend - *Executor
> 906 disconnected, so removing it*
> 2014-11-18 00:18:20,020 [spark-akka.actor.default-dispatcher-67] ERROR
> org.apache.spark.scheduler.cluster.YarnClientClusterScheduler - *Lost
> executor 906 on <machine name>: remote Akka client disassociated*
>
> 2014-11-18 16:52:02,283 [spark-akka.actor.default-dispatcher-22] WARN
>  org.apache.spark.storage.BlockManagerMasterActor - *Removing
> BlockManager BlockManagerId(9186, <machine name>, 54600, 0) with no recent
> heart beats: 82313ms exceeds 45000ms*
>
> Looking at the logs, the job never recovers from these errors, and
> continues to show errors about lost executors and launching new executors,
> and this just continues for a long time.
>
> Could this be because the executors are running out of memory?
>
> In terms of memory usage, the intermediate data could be large (after the
> HBase lookup), but partial and fully aggregated data set size should be
> quite small - essentially a bunch of ids and counts (< 1 mil in total).
>
>
>
> b. In the Spark UI, i am seeing the following errors (redacted for
> brevity), not sure if they are transient or real issue:
>
> java.net.SocketTimeoutException (java.net.SocketTimeoutException: Read timed 
> out}
> ...
> org.apache.spark.util.Utils$.fetchFile(Utils.scala:349)
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:330)
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:328)
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
> ...
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> java.lang.Thread.run(Thread.java:724)
>
>
>
>
> I was trying to get more data to investigate but haven't been able to
> figure out how to enable logging on the executors. The Spark UI appears
> stuck and i only see driver side logs in the jobhistory directory specified
> in the job.
>
>
> Thanks,
> pala
>
>
>

Reply via email to