[ 
https://issues.apache.org/jira/browse/HIVE-11672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-11672:
----------------------------------
    Fix Version/s:     (was: 1.2.2)

> Hive Streaming API handles bucketing incorrectly
> ------------------------------------------------
>
>                 Key: HIVE-11672
>                 URL: https://issues.apache.org/jira/browse/HIVE-11672
>             Project: Hive
>          Issue Type: Bug
>          Components: HCatalog, Hive, Transactions
>    Affects Versions: 1.2.1
>            Reporter: Raj Bains
>            Assignee: Roshan Naik
>            Priority: Critical
>
> Hive Streaming API allows the clients to get a random bucket and then insert 
> data into it. However, this leads to incorrect bucketing as Hive expects data 
> to be distributed into buckets based on a hash function applied to bucket 
> key. The data is inserted randomly by the clients right now. They have no way 
> of
> # Knowing what bucket a row (tuple) belongs to
> # Asking for a specific bucket
> There are optimization such as Sort Merge Join and Bucket Map Join that rely 
> on the data being correctly distributed across buckets and these will cause 
> incorrect read results if the data is not distributed correctly.
> There are two obvious design choices
> # Hive Streaming API should fix this internally by distributing the data 
> correctly
> # Hive Streaming API should expose data distribution scheme to the clients 
> and allow them to distribute the data correctly
> The first option will mean every client thread will write to many buckets, 
> causing many small files in each bucket and too many connections open. this 
> does not seem feasible. The second option pushes more functionality into the 
> client of the Hive Streaming API, but can maintain high throughput and write 
> good sized ORC files. This option seems preferable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to