Re: What are the likely causes of org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle?

Eugen Cepoi Fri, 26 Jun 2015 13:34:52 -0700

Are you using yarn?
If yes increase the yarn memory overhead option. Yarn is probably killing
your executors.
Le 26 juin 2015 20:43, "XianXing Zhang" <xianxing.zh...@gmail.com> a écrit :


> Do we have any update on this thread? Has anyone met and solved similar
> problems before?
>
> Any pointers will be greatly appreciated!
>
> Best,
> XianXing
>
> On Mon, Jun 15, 2015 at 11:48 PM, Jia Yu <jia...@asu.edu> wrote:
>
>> Hi Peng,
>>
>> I got exactly same error! My shuffle data is also very large. Have you
>> figured out a method to solve that?
>>
>> Thanks,
>> Jia
>>
>> On Fri, Apr 24, 2015 at 7:59 AM, Peng Cheng <pc...@uow.edu.au> wrote:
>>
>>> I'm deploying a Spark data processing job on an EC2 cluster, the job is
>>> small
>>> for the cluster (16 cores with 120G RAM in total), the largest RDD has
>>> only
>>> 76k+ rows. But heavily skewed in the middle (thus requires
>>> repartitioning)
>>> and each row has around 100k of data after serialization. The job always
>>> got
>>> stuck in repartitioning. Namely, the job will constantly get following
>>> errors and retries:
>>>
>>> org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output
>>> location for shuffle
>>>
>>> org.apache.spark.shuffle.FetchFailedException: Error in opening
>>> FileSegmentManagedBuffer
>>>
>>> org.apache.spark.shuffle.FetchFailedException:
>>> java.io.FileNotFoundException: /tmp/spark-...
>>> I've tried to identify the problem but it seems like both memory and disk
>>> consumption of the machine throwing these errors are below 50%. I've also
>>> tried different configurations, including:
>>>
>>> let driver/executor memory use 60% of total memory.
>>> let netty to priortize JVM shuffling buffer.
>>> increase shuffling streaming buffer to 128m.
>>> use KryoSerializer and max out all buffers
>>> increase shuffling memoryFraction to 0.4
>>> But none of them works. The small job always trigger the same series of
>>> errors and max out retries (upt to 1000 times). How to troubleshoot this
>>> thing in such situation?
>>>
>>> Thanks a lot if you have any clue.
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/What-are-the-likely-causes-of-org-apache-spark-shuffle-MetadataFetchFailedException-Missing-an-outpu-tp22646.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>

Re: What are the likely causes of org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle?

Reply via email to