Thanks Michael for your input.

By 1) do you mean:

   - Caching the partitioned_rdd
   - Caching the partitioned_df
   - *Or* just caching unpartitioned_df without the need of creating
the partitioned_rdd
   variable?


Can you expand a little bit more 2)


Thanks!

On Wed, Oct 14, 2015 at 12:11 PM, Michael Armbrust <mich...@databricks.com>
wrote:

> This won't help as for two reasons:
>  1) Its all still just creating lineage since you aren't caching the
> partitioned data.  It will still fetch the shuffled blocks for each query.
>  2) The query optimizer is not aware of RDD level partitioning since its
> mostly a blackbox.
>
> 1) could be fixed by adding caching.  2) is on our roadmap (though you'd
> have to use logical DataFrame expressions to do the partitioning instead of
> a class based partitioner).
>
> On Wed, Oct 14, 2015 at 8:45 AM, Cesar Flores <ces...@gmail.com> wrote:
>
>>
>> My current version of spark is 1.3.0 and my question is the next:
>>
>> I have large data frames where the main field is an user id. I need to do
>> many group by's and joins using that field. Do the performance will
>> increase if before doing any group by or join operation I first convert to
>> rdd to partition by the user id? In other words trying something like the
>> next lines in all my user data tables will improve the performance in the
>> long run?:
>>
>> val partitioned_rdd = unpartitioned_df
>>    .map(row=>(row.getLong(0), row))
>>    .partitionBy(new HashPartitioner(200))
>>    .map(x => x._2)
>>
>> val partitioned_df = hc.createDataFrame(partitioned_rdd,
>> unpartitioned_df.schema)
>>
>>
>>
>>
>> Thanks a lot
>> --
>> Cesar Flores
>>
>
>


-- 
Cesar Flores

Reply via email to