GitHub user tejasapatil opened a pull request: https://github.com/apache/spark/pull/18954
[SPARK-17654] [SQL] Enable creating hive bucketed tables ## What changes were proposed in this pull request? ### Semantics: - If the Hive table is bucketed, then INSERT node expect the child distribution to be based on the hash of the bucket columns. Else it would be empty. (Just to compare with Spark native bucketing : the required distribution is not enforced even if the table is bucketed or not... this saves the shuffle in comparison with hive). - Sort ordering for INSERT node over Hive bucketed table is determined as follows: | Table type | Normal table | Bucketed table | | ------------- | ------------- | ------------- | | non-partitioned insert | Nil | sort columns | | static partition | Nil | sort columns | | dynamic partitions | partition columns | (partition columns + bucketId + sort columns) | Just to compare how sort ordering is expressed for Spark native bucketing: | Table type | Normal table | Bucketed table | | ------------- | ------------- | ------------- | | sort ordering | partition columns | (partition columns + bucketId + sort columns) | Why is there a difference ? With hive, since there bucketed insertions would need a shuffle, sort ordering can be relaxed for both non-partitioned and static partition cases. Every RDD partition would get rows corresponding to a single bucket so those can be written to corresponding output file after sort. In case of dynamic partitions, the rows need to be routed to appropriate partition which makes it similar to Spark's constraints. - Only `Overwrite` mode is allowed for hive bucketed tables as any other mode will break the bucketing guarantees of the table. This is a difference wrt how Spark bucketing works. - With the PR, if there are no files created for empty buckets, the query will fail. Will support creation of empty files in coming iteration. This is a difference wrt how Spark bucketing works as it does NOT need files for empty buckets. ### Summary of changes done: - `ClusteredDistribution` and `HashPartitioning` are modified to store the hashing function used. - `RunnableCommand`'s' can now express the required distribution and ordering. This is used by `ExecutedCommandExec` which run these commands - The good thing about this is that I could remove the logic for enforcing sort ordering inside `FileFormatWriter` which felt out of place. Ideally, this kinda adding of physical nodes should be done within the planner which is what happens with this PR. - `InsertIntoHiveTable` enforces both distribution and sort ordering - `InsertIntoHadoopFsRelationCommand` enforces sort ordering ONLY (and not the distribution) - Fixed a bug due to which any alter commands to bucketed table (eg. updating stats) would wipe out the bucketing spec from metastore. This made insertions to bucketed table non-idempotent operation. ## How was this patch tested? - Added new unit tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/tejasapatil/spark bucket_write Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18954.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18954 ---- commit 43fae74ff017959edbffa1cbd1405f58c5abe279 Author: Tejas Patil <tej...@fb.com> Date: 2017-08-03T22:57:54Z bucketed writer implementation commit 4b009a909768f2d8066fb58a45d1c54378fa8ff9 Author: Tejas Patil <tej...@fb.com> Date: 2017-08-15T23:27:06Z Move `requiredOrdering` into RunnableCommand instead of `FileFormatWriter` ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org