Are you using yarn? If yes increase the yarn memory overhead option. Yarn is probably killing your executors. Le 26 juin 2015 20:43, "XianXing Zhang" <xianxing.zh...@gmail.com> a écrit :
> Do we have any update on this thread? Has anyone met and solved similar > problems before? > > Any pointers will be greatly appreciated! > > Best, > XianXing > > On Mon, Jun 15, 2015 at 11:48 PM, Jia Yu <jia...@asu.edu> wrote: > >> Hi Peng, >> >> I got exactly same error! My shuffle data is also very large. Have you >> figured out a method to solve that? >> >> Thanks, >> Jia >> >> On Fri, Apr 24, 2015 at 7:59 AM, Peng Cheng <pc...@uow.edu.au> wrote: >> >>> I'm deploying a Spark data processing job on an EC2 cluster, the job is >>> small >>> for the cluster (16 cores with 120G RAM in total), the largest RDD has >>> only >>> 76k+ rows. But heavily skewed in the middle (thus requires >>> repartitioning) >>> and each row has around 100k of data after serialization. The job always >>> got >>> stuck in repartitioning. Namely, the job will constantly get following >>> errors and retries: >>> >>> org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output >>> location for shuffle >>> >>> org.apache.spark.shuffle.FetchFailedException: Error in opening >>> FileSegmentManagedBuffer >>> >>> org.apache.spark.shuffle.FetchFailedException: >>> java.io.FileNotFoundException: /tmp/spark-... >>> I've tried to identify the problem but it seems like both memory and disk >>> consumption of the machine throwing these errors are below 50%. I've also >>> tried different configurations, including: >>> >>> let driver/executor memory use 60% of total memory. >>> let netty to priortize JVM shuffling buffer. >>> increase shuffling streaming buffer to 128m. >>> use KryoSerializer and max out all buffers >>> increase shuffling memoryFraction to 0.4 >>> But none of them works. The small job always trigger the same series of >>> errors and max out retries (upt to 1000 times). How to troubleshoot this >>> thing in such situation? >>> >>> Thanks a lot if you have any clue. >>> >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/What-are-the-likely-causes-of-org-apache-spark-shuffle-MetadataFetchFailedException-Missing-an-outpu-tp22646.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> >> >