For sql shuffle operations like groupby, the number of output partitions is controlled by spark.sql.shuffle.partitions. But, it seems orderBy does not honour this.
In my small test, I could see that the number of partitions in DF returned by orderBy was equal to the total number of distinct keys. Are you observing the same, I mean do you have a single value for all rows in the column on which you are running orderBy? If yes, you are better off not running the orderBy clause. May be someone from spark sql team could answer that how should the partitioning of the output DF be handled when doing an orderBy? Hemant www.snappydata.io https://github.com/SnappyDataInc/snappydata On Tue, Feb 9, 2016 at 4:00 AM, Cesar Flores <ces...@gmail.com> wrote: > > I have a data frame which I sort using orderBy function. This operation > causes my data frame to go to a single partition. After using those > results, I would like to re-partition to a larger number of partitions. > Currently I am just doing: > > val rdd = df.rdd.coalesce(100, true) //df is a dataframe with a single > partition and around 14 million records > val newDF = hc.createDataFrame(rdd, df.schema) > > This process is really slow. Is there any other way of achieving this > task, or to optimize it (perhaps tweaking a spark configuration parameter)? > > > Thanks a lot > -- > Cesar Flores >