My current version of spark is 1.3.0 and my question is the next: I have large data frames where the main field is an user id. I need to do many group by's and joins using that field. Do the performance will increase if before doing any group by or join operation I first convert to rdd to partition by the user id? In other words trying something like the next lines in all my user data tables will improve the performance in the long run?:
val partitioned_rdd = unpartitioned_df .map(row=>(row.getLong(0), row)) .partitionBy(new HashPartitioner(200)) .map(x => x._2) val partitioned_df = hc.createDataFrame(partitioned_rdd, unpartitioned_df.schema) Thanks a lot -- Cesar Flores