Re: How to troubleshoot MetadataFetchFailedException: Missing an output location for shuffle 0

2019-12-16 Thread Alessandro Solimando
Hi Warren,
it's often an exception stemming from an OOM at the executor level.

If you are caching data make sure you spill to disk, if needed.

You could also try to increase off-heap memory to alleviate the issue.

Of course also handing more memory to the executor helps.

Best regards,
Alessandro

Il Lun 16 Dic 2019, 21:01 Warren Zhu  ha scritto:

> Hi All,
>
> I have seen this exception many times in my production environment for
> long running batch job. Is there some stigmatization of all root causes of
> this exception? Below is my analysis:
>
> 1. This happens when executor try to fetch MapStatus of some shuffle.
> 2. Each executor maintains a local cache of all map statuses. When can't
> find in local cache, executor will try to fetch latest from driver which
> acting as MapOutputTrackerMaster.
> 3. Driver's map statuses only be clear when epoch got updated.
> 4. Epoch got updated when new executor got restarted. This might be caused
> by executor lost. I have double confirmed this if one container(executor)
> is kill by Yarn for exceeding memory limits, then this exception will
> happen.
>
> So I have 3 questions:
> 1. Is my analysis correct?
> 2. Is there some other clues or causes could result in this exception?
> 3. How to fix this exception?
>
> Thanks,
> Warren
>


How to troubleshoot MetadataFetchFailedException: Missing an output location for shuffle 0

2019-12-16 Thread Warren Zhu
Hi All,

I have seen this exception many times in my production environment for long
running batch job. Is there some stigmatization of all root causes of this
exception? Below is my analysis:

1. This happens when executor try to fetch MapStatus of some shuffle.
2. Each executor maintains a local cache of all map statuses. When can't
find in local cache, executor will try to fetch latest from driver which
acting as MapOutputTrackerMaster.
3. Driver's map statuses only be clear when epoch got updated.
4. Epoch got updated when new executor got restarted. This might be caused
by executor lost. I have double confirmed this if one container(executor)
is kill by Yarn for exceeding memory limits, then this exception will
happen.

So I have 3 questions:
1. Is my analysis correct?
2. Is there some other clues or causes could result in this exception?
3. How to fix this exception?

Thanks,
Warren


MetadataFetchFailedException: Missing an output location for shuffle 0

2016-03-03 Thread Pierre Villard
Hi,

I have set up a spark job and it keeps failing even though I tried a lot of
different configurations regarding memory parameters (as suggested in other
threads I read).

My configuration:

Cluster of 4 machines: 4vCPU, 16Go RAM.
YARN version: 2.7.1
Spark version: 1.5.2

I tried a lot of configurations regarding the memory by executor, and the
YARN executor overhead. The last try was : 4 executors, 8Go by executor,
4Go overhead (spark.yarn.executor.memoryOverhead).

My data initially represents about 10Go only but I am joining multiple
datasets and the 10Go are not representative of the final dataset on which
I am trying to do some machine learning.

As far as I looked into the logs, YARN does not seem to be killing the
container. Besides, when persisting RDDs I changed the storage level to
MEMORY_AND_DISK_SER.

Aat the end it keeps failing with the following stack trace:

org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output
location for shuffle 0
at
org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:460)
at
org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:456)
at
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at
scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at
org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:456)
at
org.apache.spark.MapOutputTracker.getMapSizesByExecutorId(MapOutputTracker.scala:183)
at
org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:47)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:90)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

For the record, when executing the job with a much smaller input dataset,
everything is OK.

Any idea?

Thanks!
Pierre