+1
This is one of the most common problems we encounter in our flow. Mark, I
am happy to help if you would like to share some of the workload.
Best
Yash
On Wednesday 2 March 2016, Mark Hamstra wrote:
> I don't entirely agree. You're best off picking the right size :).
> That's almost impossib
I don't entirely agree. You're best off picking the right size :). That's
almost impossible, though, since at the input end of the query processing
you often want a large number of partitions to get sufficient parallelism
for both performance and to avoid spilling or OOM, while at the output end
If you have to pick a number, its better to over estimate than
underestimate since task launching in spark is relatively cheap compared to
spilling to disk or OOMing (now much less likely due to Tungsten).
Eventually, we plan to make this dynamic, but you should tune for your
particular workload.
Hi,
I was wondering what the rationale behind defaulting all repartitioning to
spark.sql.shuffle.partitions is. I’m seeing a huge overhead when running a job
whose input partitions is 2 and, using the default value for
spark.sql.shuffle.partitions, this is now 200. Thanks.
-Teng Fei Liao