[jira] [Commented] (SPARK-19256) Hive bucketing support

Cheng Su (Jira) Wed, 26 Aug 2020 22:57:14 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-19256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17185617#comment-17185617
 ]


Cheng Su commented on SPARK-19256:
----------------------------------

Hi all, after discussion with [~cloud_fan], we decide to support Hive bucketed 
table write path. Specifically:

1.for Hive 3.x.y, writing Hive bucketed table with Hive murmur3hash.

2.for Hive 2.x.y and 1.x.y, writing Hive bucketed table with Hive hivehash.

3.zero or multiple files per bucket should be expected by user.

 

This is compatible with Hive and Presto (presto-sql for now), so user can be 
benefitted from writing Hive bucketed table correctly from Spark, and read 
bucketed table efficiently from Presto and Hive.

See JIRA description for details, and I will work on each sub-tasks, thanks.

> Hive bucketing support
> ----------------------
>
>                 Key: SPARK-19256
>                 URL: https://issues.apache.org/jira/browse/SPARK-19256
>             Project: Spark
>          Issue Type: Umbrella
>          Components: SQL
>    Affects Versions: 2.1.0
>            Reporter: Tejas Patil
>            Priority: Minor
>
> Update (2020 by Cheng Su):
> We use this JIRA to track progress for Hive bucketing write support in Spark. 
> The goal is for Spark to write Hive bucketed table, to be compatible with 
> other compute engines (Hive and Presto).
>  
> Current status for Hive bucketed table in Spark:
> Not support for reading Hive bucketed table: read bucketed table as 
> non-bucketed table.
> Wrong behavior for writing Hive ORC and Parquet bucketed table: write 
> orc/parquet bucketed table as non-bucketed table (code path: 
> InsertIntoHadoopFsRelationCommand -> FileFormatWriter).
> Do not allow for writing Hive non-ORC/Parquet bucketed table: throw exception 
> by default if writing non-orc/parquet bucketed table (code path: 
> InsertIntoHiveTable), and exception can be disabled by setting config 
> `hive.enforce.bucketing`=false and `hive.enforce.sorting`=false, which will 
> write as non-bucketed table.
>  
> Current status for Hive bucketed table in Hive:
> Hive 3.0.0 and after: support writing bucketed table with Hive murmur3hash 
> (https://issues.apache.org/jira/browse/HIVE-18910).
> Hive 1.x.y and 2.x.y: support writing bucketed table with Hive hivehash.
> Hive on Tez: support zero and multiple files per bucket 
> (https://issues.apache.org/jira/browse/HIVE-13988).
>  
> Current status for Hive bucketed table in Presto (take presto-sql here):
> Support writing bucketed table with Hive murmur3hash and hivehash 
> ([https://github.com/prestosql/presto/pull/1697]).
> Support zero and multiple files per bucket 
> ([https://github.com/prestosql/presto/pull/822]).
>  
> TLDR is to achieve Hive bucketed table compatibility across Spark, Presto and 
> Hive. Here with this JIRA, we need to add support writing Hive bucketed table 
> with Hive murmur3hash (for Hive 3.x.y) and hivehash (for Hive 1.x.y and 
> 2.x.y).
>  
> To allow Spark efficiently read Hive bucketed table, this needs more radical 
> change and we decide to wait until data source v2 supports bucketing, and do 
> the read path on data source v2. Read path will not covered by this JIRA.
>  
> Original description (2017 by Tejas Patil):
> JIRA to track design discussions and tasks related to Hive bucketing support 
> in Spark.
> Proposal : 
> [https://docs.google.com/document/d/1a8IDh23RAkrkg9YYAeO51F4aGO8-xAlupKwdshve2fc/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19256) Hive bucketing support

Reply via email to