GitHub user tejasapatil opened a pull request: https://github.com/apache/spark/pull/15228
[SPARK-17654] [SQL] Propagate bucketing information for Hive tables to / from Catalog ## What changes were proposed in this pull request? Currently Spark does not respect bucketing for Hive tables. This PR includes following changes: - will extract table's bucketing information in `HiveClientImpl` - while writing table info to metastore, `MetastoreRelation` now populates the bucketing information in the hive `Table` object - `HiveTableScanExec` now exposes `outputPartitioning` and `outputOrdering` as per bucketing spec. - `InsertIntoHiveTable` now exposes `requiredChildDistribution` and `requiredChildOrdering` based on the target table's bucketing spec. TODOs (which will be done in linked PRs and not this one): - [ ] `ClusteredDistribution` does not guarantee the number of partitions (which corresponds to output bucket files created) generated. This will require adding strict guarantees to `ClusteredDistribution`. I think it will need more thought and better to do incrementally and not packing in this PR. - [ ] While writing to bucketed files, Hive's hashing function should be used. I have a PR open to implement Hive hashing native in Spark : https://github.com/apache/spark/pull/15047 - [ ] Allow creating Hive bucketed tables ## How was this patch tested? Tested with Hive tables created locally. Adding a new test case will need implementing bucketed table creation which is not supported :( Suggestions welcome. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tejasapatil/spark SPARK-17654_hive_extract_bucketing Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15228.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15228 ---- ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org