[ https://issues.apache.org/jira/browse/SPARK-12394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16195004#comment-16195004 ]
Jeetendra Nihalani commented on SPARK-12394: -------------------------------------------- Is this issue resolved? If yes, can someone update this with a pull request which resolved the issue. > Support writing out pre-hash-partitioned data and exploit that in join > optimizations to avoid shuffle (i.e. bucketing in Hive) > ------------------------------------------------------------------------------------------------------------------------------ > > Key: SPARK-12394 > URL: https://issues.apache.org/jira/browse/SPARK-12394 > Project: Spark > Issue Type: New Feature > Components: SQL > Reporter: Reynold Xin > Assignee: Nong Li > Fix For: 2.0.0 > > Attachments: BucketedTables.pdf > > > In many cases users know ahead of time the columns that they will be joining > or aggregating on. Ideally they should be able to leverage this information > and pre-shuffle the data so that subsequent queries do not require a shuffle. > Hive supports this functionality by allowing the user to define buckets, > which are hash partitioning of the data based on some key. > - Allow the user to specify a set of columns when caching or writing out data > - Allow the user to specify some parallelism > - Shuffle the data when writing / caching such that its distributed by these > columns > - When planning/executing a query, use this distribution to avoid another > shuffle when reading, assuming the join or aggregation is compatible with the > columns specified > - Should work with existing save modes: append, overwrite, etc > - Should work at least with all Hadoops FS data sources > - Should work with any data source when caching -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org