Re: Dataframe Partitioning

2016-03-01 Thread yash datta
+1 This is one of the most common problems we encounter in our flow. Mark, I am happy to help if you would like to share some of the workload. Best Yash On Wednesday 2 March 2016, Mark Hamstra wrote: > I don't entirely agree. You're best off picking the right size :). > That's almost impossib

Re: Dataframe Partitioning

2016-03-01 Thread Mark Hamstra
I don't entirely agree. You're best off picking the right size :). That's almost impossible, though, since at the input end of the query processing you often want a large number of partitions to get sufficient parallelism for both performance and to avoid spilling or OOM, while at the output end

Re: Dataframe Partitioning

2016-03-01 Thread Michael Armbrust
If you have to pick a number, its better to over estimate than underestimate since task launching in spark is relatively cheap compared to spilling to disk or OOMing (now much less likely due to Tungsten). Eventually, we plan to make this dynamic, but you should tune for your particular workload.

Dataframe Partitioning

2016-03-01 Thread Teng Liao
Hi, I was wondering what the rationale behind defaulting all repartitioning to spark.sql.shuffle.partitions is. I’m seeing a huge overhead when running a job whose input partitions is 2 and, using the default value for spark.sql.shuffle.partitions, this is now 200. Thanks. -Teng Fei Liao