[GitHub] spark pull request #18954: [SPARK-17654] [SQL] Enable creating hive bucketed...

tejasapatil Tue, 15 Aug 2017 21:43:16 -0700

GitHub user tejasapatil opened a pull request:

    https://github.com/apache/spark/pull/18954


    [SPARK-17654] [SQL] Enable creating hive bucketed tables

    ## What changes were proposed in this pull request?
    
    ### Semantics:
    - If the Hive table is bucketed, then INSERT node expect the child 
distribution to be based on the hash of the bucket columns. Else it would be 
empty. (Just to compare with Spark native bucketing : the required distribution 
is not enforced even if the table is bucketed or not... this saves the shuffle 
in comparison with hive).
    - Sort ordering for INSERT node over Hive bucketed table is determined as 
follows:
    
    | Table type   | Normal table | Bucketed table |
    | ------------- | ------------- | ------------- |
    | non-partitioned insert  | Nil | sort columns |
    | static partition   | Nil | sort columns |
    | dynamic partitions   | partition columns | (partition columns + bucketId 
+ sort columns) |
    
    Just to compare how sort ordering is expressed for Spark native bucketing:
    
    | Table type   | Normal table | Bucketed table |
    | ------------- | ------------- | ------------- |
    |  sort ordering | partition columns | (partition columns + bucketId + sort 
columns) |
    
    Why is there a difference ? With hive, since there bucketed insertions 
would need a shuffle, sort ordering can be relaxed for both non-partitioned and 
static partition cases. Every RDD partition would get rows corresponding to a 
single bucket so those can be written to corresponding output file after sort. 
In case of dynamic partitions, the rows need to be routed to appropriate 
partition which makes it similar to Spark's constraints.
    
    - Only `Overwrite` mode is allowed for hive bucketed tables as any other 
mode will break the bucketing guarantees of the table. This is a difference wrt 
how Spark bucketing works.
    - With the PR, if there are no files created for empty buckets, the query 
will fail. Will support creation of empty files in coming iteration. This is a 
difference wrt how Spark bucketing works as it does NOT need files for empty 
buckets.
    
    ### Summary of changes done:
    - `ClusteredDistribution` and `HashPartitioning` are modified to store the 
hashing function used.
    - `RunnableCommand`'s' can now express the required distribution and 
ordering. This is used by `ExecutedCommandExec` which run these commands
      - The good thing about this is that I could remove the logic for 
enforcing sort ordering inside `FileFormatWriter` which felt out of place. 
Ideally, this kinda adding of physical nodes should be done within the planner 
which is what happens with this PR.
    - `InsertIntoHiveTable` enforces both distribution and sort ordering
    - `InsertIntoHadoopFsRelationCommand` enforces sort ordering ONLY (and not 
the distribution)
    - Fixed a bug due to which any alter commands to bucketed table (eg. 
updating stats) would wipe out the bucketing spec from metastore. This made 
insertions to bucketed table non-idempotent operation.
    
    ## How was this patch tested?
    
    - Added new unit tests

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tejasapatil/spark bucket_write

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18954.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18954
    
----
commit 43fae74ff017959edbffa1cbd1405f58c5abe279
Author: Tejas Patil <tej...@fb.com>
Date:   2017-08-03T22:57:54Z

    bucketed writer implementation

commit 4b009a909768f2d8066fb58a45d1c54378fa8ff9
Author: Tejas Patil <tej...@fb.com>
Date:   2017-08-15T23:27:06Z

    Move `requiredOrdering` into RunnableCommand instead of `FileFormatWriter`

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18954: [SPARK-17654] [SQL] Enable creating hive bucketed...

Reply via email to