Actually I met similar issue when doing groupByKey and then count if the
shuffle size is big e.g. 1tb.
Thanks.
Zhan Zhang
Sent from my iPhone
On Sep 21, 2014, at 10:56 PM, Nishkam Ravi nr...@cloudera.com wrote:
Thanks for the quick follow up Reynold and Patrick. Tried a run with
Hey all. We had also the same problem described by Nishkam almost in the
same big data setting. We fixed the fetch failure by increasing the timeout
for acks in the driver:
set(spark.core.connection.ack.wait.timeout, 600) // 10 minutes timeout
for acks between nodes
Cheers, Christoph
2014-09-22
I've run into this with large shuffles - I assumed that there was
contention between the shuffle output files and the JVM for memory.
Whenever we start getting these fetch failures, it corresponds with high
load on the machines the blocks are being fetched from, and in some cases
complete
Recently upgraded to 1.1.0. Saw a bunch of fetch failures for one of the
workloads. Tried tracing the problem through change set analysis. Looks
like the offending commit is 4fde28c from Aug 4th for PR1707. Please see
SPARK-3633 for more details.
Thanks,
Nishkam
It seems like you just need to raise the ulimit?
On Sun, Sep 21, 2014 at 8:41 PM, Nishkam Ravi nr...@cloudera.com wrote:
Recently upgraded to 1.1.0. Saw a bunch of fetch failures for one of the
workloads. Tried tracing the problem through change set analysis. Looks
like the offending commit
Hey the numbers you mentioned don't quite line up - did you mean PR 2711?
On Sun, Sep 21, 2014 at 8:45 PM, Reynold Xin r...@databricks.com wrote:
It seems like you just need to raise the ulimit?
On Sun, Sep 21, 2014 at 8:41 PM, Nishkam Ravi nr...@cloudera.com wrote:
Recently upgraded to
Ah I see it was SPARK-2711 (and PR1707). In that case, it's possible
that you are just having more spilling as a result of the patch and so
the filesystem is opening more files. I would try increasing the
ulimit.
How much memory do your executors have?
- Patrick
On Sun, Sep 21, 2014 at 10:29
Thanks for the quick follow up Reynold and Patrick. Tried a run with
significantly higher ulimit, doesn't seem to help. The executors have 35GB
each. Btw, with a recent version of the branch, the error message is fetch
failures as opposed to too many open files. Not sure if they are
related.