Re: Stage failure in BlockManager due to FileNotFoundException on long-running streaming job

2014-08-20 Thread Silvio Fiorito
mailto:user@spark.apache.org>> Subject: Re: Stage failure in BlockManager due to FileNotFoundException on long-running streaming job This is likely due to a bug in shuffle file consolidation (which you have enabled) which was hopefully fixed in 1.1 with this patch: http

Re: Stage failure in BlockManager due to FileNotFoundException on long-running streaming job

2014-08-20 Thread Aaron Davidson
This is likely due to a bug in shuffle file consolidation (which you have enabled) which was hopefully fixed in 1.1 with this patch: https://github.com/apache/spark/commit/78f2af582286b81e6dc9fa9d455ed2b369d933bd Until 1.0.3 or 1.1 are released, the simplest solution is to disable spark.shuffle.co

Stage failure in BlockManager due to FileNotFoundException on long-running streaming job

2014-08-20 Thread Silvio Fiorito
This is a long running Spark Streaming job running in YARN, Spark v1.0.2 on CDH5. The jobs will run for about 34-37 hours then die due to this FileNotFoundException. There’s very little CPU or RAM usage, I’m running 2 x cores, 2 x executors, 4g memory, YARN cluster mode. Here’s the stack trace