Hey Andrew, Matei,
Thanks for responding.
For some more context, we were running into "Too many open files" issues
where we were seeing this happen immediately after the Collect phase
(about 30 seconds into a run) on a decently sized dataset (14 MM rows).
The ulimit set in the spark-env was 256,0
(BTW this had a bug with negative hash codes in 1.1.0 so you should try
branch-1.1 for it).
Matei
> On Nov 3, 2014, at 6:28 PM, Matei Zaharia wrote:
>
> In Spark 1.1, the sort-based shuffle (spark.shuffle.manager=sort) will have
> better performance while creating fewer files. So I'd suggest
In Spark 1.1, the sort-based shuffle (spark.shuffle.manager=sort) will have
better performance while creating fewer files. So I'd suggest trying that too.
Matei
> On Nov 3, 2014, at 6:12 PM, Andrew Or wrote:
>
> Hey Matt,
>
> There's some prior work that compares consolidation performance on
Hey Matt,
There's some prior work that compares consolidation performance on some
medium-scale workload:
http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project16_report.pdf
There we noticed about 2x performance degradation in the reduce phase on
ext3. I am not aware of a
Hi everyone,
I'm running into more and more cases where too many files are opened when
spark.shuffle.consolidateFiles is turned off.
I was wondering if this is a common scenario among the rest of the
community, and if so, if it is worth considering the setting to be turned on
by default. From the