My current version of spark is 1.3.0 and my question is the next:

I have large data frames where the main field is an user id. I need to do
many group by's and joins using that field. Do the performance will
increase if before doing any group by or join operation I first convert to
rdd to partition by the user id? In other words trying something like the
next lines in all my user data tables will improve the performance in the
long run?:

val partitioned_rdd = unpartitioned_df
   .map(row=>(row.getLong(0), row))
   .partitionBy(new HashPartitioner(200))
   .map(x => x._2)

val partitioned_df = hc.createDataFrame(partitioned_rdd,
unpartitioned_df.schema)




Thanks a lot
-- 
Cesar Flores

Reply via email to