Re: Lost executors

Pala M Muthaia Tue, 18 Nov 2014 17:03:00 -0800

Sandy,

Good point - i forgot about NM logs.


When i looked up the NM logs, i only see the following statements that
align with the driver side log about lost executor. Many executors show the
same log statement at the same time, so it seems like the decision to kill
many if not all executors happened centrally, and all executors got
notified somehow:

14/11/18 00:18:25 INFO Executor: Executor is trying to kill task 2013
14/11/18 00:18:25 INFO Executor: Executor killed task 2013


In general, i also see quite a few instances of the following
exception across many executors/nodes. :

14/11/17 23:58:00 INFO HadoopRDD: Input split: <hdfs dir
path>/sorted_keys-1020_3-r-00255.deflate:0+415841

14/11/17 23:58:00 WARN BlockReaderLocal: error creating DomainSocket
java.net.ConnectException: connect(2) error: Connection refused when
trying to connect to '/srv/var/hadoop/runs/hdfs/dn_socket'
        at org.apache.hadoop.net.unix.DomainSocket.connect0(Native Method)
        at 
org.apache.hadoop.net.unix.DomainSocket.connect(DomainSocket.java:250)
        at 
org.apache.hadoop.hdfs.DomainSocketFactory.createSocket(DomainSocketFactory.java:158)
        at 
org.apache.hadoop.hdfs.BlockReaderFactory.nextDomainPeer(BlockReaderFactory.java:721)
        at 
org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:441)
        at 
org.apache.hadoop.hdfs.client.ShortCircuitCache.create(ShortCircuitCache.java:780)
        at 
org.apache.hadoop.hdfs.client.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:714)
        at 
org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:395)
        at 
org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:303)
        at 
org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:567)
        at 
org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:790)
        at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:837)
        at java.io.DataInputStream.read(DataInputStream.java:149)
        at 
org.apache.hadoop.io.compress.DecompressorStream.getCompressedData(DecompressorStream.java:159)
        at 
org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:143)
        at 
org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
        at java.io.InputStream.read(InputStream.java:101)
        at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180)
        at 
org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
        at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
        at 
org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:209)
        at 
org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:47)
        at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:201)
        at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:184)
        at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
        at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
        at scala.collection.Iterator$class.isEmpty(Iterator.scala:256)
        at scala.collection.AbstractIterator.isEmpty(Iterator.scala:1157)
        at 
$line57.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:51)
        at 
$line57.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:50)
        at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
        at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
        at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
        at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
        at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
        at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
        at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
        at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
        at org.apache.spark.scheduler.Task.run(Task.scala:51)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:724)

14/11/17 23:58:00 WARN ShortCircuitCache:
ShortCircuitCache(0x71a8053d): failed to load
1276010498_BP-1416824317-172.22.48.2-1387241776581


However, in some of the nodes, it seems execution proceeded after the
error, so the above could just be a transient error.

Finally, in the driver logs, i was looking for hint on the decision to kill
many executors, around the 00:18:25 timestamp when many tasks were killed
across many executors, but i didn't find anything different.



On Tue, Nov 18, 2014 at 1:59 PM, Sandy Ryza <sandy.r...@cloudera.com> wrote:

> Hi Pala,
>
> Do you have access to your YARN NodeManager logs?  Are you able to check
> whether they report killing any containers for exceeding memory limits?
>
> -Sandy
>
> On Tue, Nov 18, 2014 at 1:54 PM, Pala M Muthaia <
> mchett...@rocketfuelinc.com> wrote:
>
>> Hi,
>>
>> I am using Spark 1.0.1 on Yarn 2.5, and doing everything through spark
>> shell.
>>
>> I am running a job that essentially reads a bunch of HBase keys, looks up
>> HBase data, and performs some filtering and aggregation. The job works fine
>> in smaller datasets, but when i try to execute on the full dataset, the job
>> never completes. The few symptoms i notice are:
>>
>> a. The job shows progress for a while and then starts throwing lots of
>> the following errors:
>>
>> 2014-11-18 00:18:20,020 [spark-akka.actor.default-dispatcher-67] INFO
>>  org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend - *Executor
>> 906 disconnected, so removing it*
>> 2014-11-18 00:18:20,020 [spark-akka.actor.default-dispatcher-67] ERROR
>> org.apache.spark.scheduler.cluster.YarnClientClusterScheduler - *Lost
>> executor 906 on <machine name>: remote Akka client disassociated*
>>
>> 2014-11-18 16:52:02,283 [spark-akka.actor.default-dispatcher-22] WARN
>>  org.apache.spark.storage.BlockManagerMasterActor - *Removing
>> BlockManager BlockManagerId(9186, <machine name>, 54600, 0) with no recent
>> heart beats: 82313ms exceeds 45000ms*
>>
>> Looking at the logs, the job never recovers from these errors, and
>> continues to show errors about lost executors and launching new executors,
>> and this just continues for a long time.
>>
>> Could this be because the executors are running out of memory?
>>
>> In terms of memory usage, the intermediate data could be large (after the
>> HBase lookup), but partial and fully aggregated data set size should be
>> quite small - essentially a bunch of ids and counts (< 1 mil in total).
>>
>>
>>
>> b. In the Spark UI, i am seeing the following errors (redacted for
>> brevity), not sure if they are transient or real issue:
>>
>> java.net.SocketTimeoutException (java.net.SocketTimeoutException: Read timed 
>> out}
>> ...
>> org.apache.spark.util.Utils$.fetchFile(Utils.scala:349)
>> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:330)
>> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:328)
>> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>> ...
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> java.lang.Thread.run(Thread.java:724)
>>
>>
>>
>>
>> I was trying to get more data to investigate but haven't been able to
>> figure out how to enable logging on the executors. The Spark UI appears
>> stuck and i only see driver side logs in the jobhistory directory specified
>> in the job.
>>
>>
>> Thanks,
>> pala
>>
>>
>>
>

Re: Lost executors

Reply via email to