Re: Optimal way to re-partition from a single partition

2016-02-09 Thread Hemant Bhanawat
For sql shuffle operations like groupby, the number of output partitions is controlled by spark.sql.shuffle.partitions. But, it seems orderBy does not honour this. In my small test, I could see that the number of partitions in DF returned by orderBy was equal to the total number of distinct

Re: Optimal way to re-partition from a single partition

2016-02-09 Thread Takeshi Yamamuro
Hi, DataFrame#sort() uses `RangePartitioning` in `Exchange` instead of `HashPartitioning`. `RangePartitioning` roughly samples input data and internally computes partition bounds to split given rows into `spark.sql.shuffle.partitions` partitions. Therefore, when sort keys are highly skewed, I

Re: Optimal way to re-partition from a single partition

2016-02-09 Thread Cesar Flores
Well, actually I am observing a single partition no matter what my input is. I am using spark 1.3.1. For what you both are saying, it appears that this sorting issue (going to a single partition after applying orderBy in a DF) is solved in later version of Spark? Well, if that is the case, I

Re: Optimal way to re-partition from a single partition

2016-02-09 Thread Takeshi Yamamuro
The issue is not almost solved even in newer Spark. On Wed, Feb 10, 2016 at 1:36 AM, Cesar Flores wrote: > Well, actually I am observing a single partition no matter what my input > is. I am using spark 1.3.1. > > For what you both are saying, it appears that this sorting

Re: Optimal way to re-partition from a single partition

2016-02-09 Thread Hemant Bhanawat
Ohk. I was comparing groupBy with orderBy and now I realize that they are using different partitioning schemes. Thanks Takeshi. On Tue, Feb 9, 2016 at 9:09 PM, Takeshi Yamamuro wrote: > Hi, > > DataFrame#sort() uses `RangePartitioning` in `Exchange` instead of >

Re: Optimal way to re-partition from a single partition

2016-02-08 Thread Takeshi Yamamuro
Hi, Plz use DataFrame#repartition. On Tue, Feb 9, 2016 at 7:30 AM, Cesar Flores wrote: > > I have a data frame which I sort using orderBy function. This operation > causes my data frame to go to a single partition. After using those > results, I would like to re-partition to

Optimal way to re-partition from a single partition

2016-02-08 Thread Cesar Flores
I have a data frame which I sort using orderBy function. This operation causes my data frame to go to a single partition. After using those results, I would like to re-partition to a larger number of partitions. Currently I am just doing: val rdd = df.rdd.coalesce(100, true) //df is a dataframe