Has there been any further work (or at least consideration) on joins working on local data - via copartitioning?
2015-05-09 11:39 GMT-07:00 Michael Armbrust <mich...@databricks.com>: > Ah, unfortunately that is not possible today as Catalyst has a logical > notion of partitioning that is different than that exposed by the RDD. > > A researcher at Databricks is considering allowing this kind of > optimization for in memory cached relations though. Here is a WIP patch: > > https://github.com/yhuai/spark/commit/aa4448ac801f833f7c8217fdce30cba9c803a877 > > On Fri, May 8, 2015 at 4:06 PM, Daniel, Ronald (ELS-SDG) < > r.dan...@elsevier.com> wrote: > >> Just trying to make sure that something I know in advance (the joins >> will always have an equality test on one specific field) is used to >> optimize the partitioning so the joins only use local data. >> >> >> >> Thanks for the info. >> >> >> >> Ron >> >> >> >> >> >> *From:* Michael Armbrust [mailto:mich...@databricks.com] >> *Sent:* Friday, May 08, 2015 3:15 PM >> *To:* Daniel, Ronald (ELS-SDG) >> *Cc:* user@spark.apache.org >> *Subject:* Re: Hash Partitioning and Dataframes >> >> >> >> What are you trying to accomplish? Internally Spark SQL will add >> Exchange operators to make sure that data is partitioned correctly for >> joins and aggregations. If you are going to do other RDD operations on the >> result of dataframe operations and you need to manually control the >> partitioning, call df.rdd and partition as you normally would. >> >> >> >> On Fri, May 8, 2015 at 2:47 PM, Daniel, Ronald (ELS-SDG) < >> r.dan...@elsevier.com> wrote: >> >> Hi, >> >> How can I ensure that a batch of DataFrames I make are all partitioned >> based on the value of one column common to them all? >> For RDDs I would partitionBy a HashPartitioner, but I don't see that in >> the DataFrame API. >> If I partition the RDDs that way, then do a toDF(), will the partitioning >> be preserved? >> >> Thanks, >> Ron >> >> >> > >