subject:"Question about data frame partitioning in Spark 1.3.0"

Question about data frame partitioning in Spark 1.3.0

2015-10-14 Thread Cesar Flores

My current version of spark is 1.3.0 and my question is the next: I have large data frames where the main field is an user id. I need to do many group by's and joins using that field. Do the performance will increase if before doing any group by or join operation I first convert to rdd to

Re: Question about data frame partitioning in Spark 1.3.0

2015-10-14 Thread Cesar Flores

Thanks Michael for your input. By 1) do you mean: - Caching the partitioned_rdd - Caching the partitioned_df - *Or* just caching unpartitioned_df without the need of creating the partitioned_rdd variable? Can you expand a little bit more 2) Thanks! On Wed, Oct 14, 2015 at 12:11

Re: Question about data frame partitioning in Spark 1.3.0

2015-10-14 Thread Michael Armbrust

This won't help as for two reasons: 1) Its all still just creating lineage since you aren't caching the partitioned data. It will still fetch the shuffled blocks for each query. 2) The query optimizer is not aware of RDD level partitioning since its mostly a blackbox. 1) could be fixed by

Re: Question about data frame partitioning in Spark 1.3.0

2015-10-14 Thread Michael Armbrust

Caching the partitioned_df <- this one, but you have to do the partitioning using something like sql("SELECT * FROM ... CLUSTER BY a") as there is no such operation exposed on dataframes. 2) Here is the JIRA: https://issues.apache.org/jira/browse/SPARK-5354

Question about data frame partitioning in Spark 1.3.0

Re: Question about data frame partitioning in Spark 1.3.0

Re: Question about data frame partitioning in Spark 1.3.0

Re: Question about data frame partitioning in Spark 1.3.0

4 matches

Site Navigation

Mail list logo

Footer information