Hi Peng, I got exactly same error! My shuffle data is also very large. Have you figured out a method to solve that?
Thanks, Jia On Fri, Apr 24, 2015 at 7:59 AM, Peng Cheng <pc...@uow.edu.au> wrote: > I'm deploying a Spark data processing job on an EC2 cluster, the job is > small > for the cluster (16 cores with 120G RAM in total), the largest RDD has only > 76k+ rows. But heavily skewed in the middle (thus requires repartitioning) > and each row has around 100k of data after serialization. The job always > got > stuck in repartitioning. Namely, the job will constantly get following > errors and retries: > > org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output > location for shuffle > > org.apache.spark.shuffle.FetchFailedException: Error in opening > FileSegmentManagedBuffer > > org.apache.spark.shuffle.FetchFailedException: > java.io.FileNotFoundException: /tmp/spark-... > I've tried to identify the problem but it seems like both memory and disk > consumption of the machine throwing these errors are below 50%. I've also > tried different configurations, including: > > let driver/executor memory use 60% of total memory. > let netty to priortize JVM shuffling buffer. > increase shuffling streaming buffer to 128m. > use KryoSerializer and max out all buffers > increase shuffling memoryFraction to 0.4 > But none of them works. The small job always trigger the same series of > errors and max out retries (upt to 1000 times). How to troubleshoot this > thing in such situation? > > Thanks a lot if you have any clue. > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/What-are-the-likely-causes-of-org-apache-spark-shuffle-MetadataFetchFailedException-Missing-an-outpu-tp22646.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >