[ https://issues.apache.org/jira/browse/HIVE-11672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Eugene Koifman resolved HIVE-11672. ----------------------------------- Resolution: Duplicate > Hive Streaming API handles bucketing incorrectly > ------------------------------------------------ > > Key: HIVE-11672 > URL: https://issues.apache.org/jira/browse/HIVE-11672 > Project: Hive > Issue Type: Bug > Components: HCatalog, Hive, Transactions > Affects Versions: 1.2.1 > Reporter: Raj Bains > Assignee: Roshan Naik > Priority: Critical > > Hive Streaming API allows the clients to get a random bucket and then insert > data into it. However, this leads to incorrect bucketing as Hive expects data > to be distributed into buckets based on a hash function applied to bucket > key. The data is inserted randomly by the clients right now. They have no way > of > # Knowing what bucket a row (tuple) belongs to > # Asking for a specific bucket > There are optimization such as Sort Merge Join and Bucket Map Join that rely > on the data being correctly distributed across buckets and these will cause > incorrect read results if the data is not distributed correctly. > There are two obvious design choices > # Hive Streaming API should fix this internally by distributing the data > correctly > # Hive Streaming API should expose data distribution scheme to the clients > and allow them to distribute the data correctly > The first option will mean every client thread will write to many buckets, > causing many small files in each bucket and too many connections open. this > does not seem feasible. The second option pushes more functionality into the > client of the Hive Streaming API, but can maintain high throughput and write > good sized ORC files. This option seems preferable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)