Question about data frame partitioning in Spark 1.3.0

Cesar Flores Wed, 14 Oct 2015 08:46:46 -0700

My current version of spark is 1.3.0 and my question is the next:

I have large data frames where the main field is an user id. I need to do
many group by's and joins using that field. Do the performance will
increase if before doing any group by or join operation I first convert to
rdd to partition by the user id? In other words trying something like the
next lines in all my user data tables will improve the performance in the
long run?:


val partitioned_rdd = unpartitioned_df
   .map(row=>(row.getLong(0), row))
   .partitionBy(new HashPartitioner(200))
   .map(x => x._2)

val partitioned_df = hc.createDataFrame(partitioned_rdd,
unpartitioned_df.schema)




Thanks a lot
-- 
Cesar Flores

Question about data frame partitioning in Spark 1.3.0

Reply via email to