[ https://issues.apache.org/jira/browse/HIVE-5170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13774655#comment-13774655 ]
Gopal V commented on HIVE-5170: ------------------------------- Tried to do this, unfortunately the FileSinkOperator uses the task-id as the bucket filename. So if you have 12 reducers, the last reducer will automatically write it to 00011_0. This makes it slightly more complex to fix this without writing a new SortedFileSinkOperator. > Sorted Bucketed Partitioned Insert hard-codes the reducer count == bucket > count > ------------------------------------------------------------------------------- > > Key: HIVE-5170 > URL: https://issues.apache.org/jira/browse/HIVE-5170 > Project: Hive > Issue Type: Bug > Components: Query Processor > Affects Versions: 0.12.0 > Environment: Ubuntu LXC > Reporter: Gopal V > > When performing a hive sorted-partitioned insert, the insert optimizer > hard-codes the number of output files to the actual bucket count of the table. > https://github.com/apache/hive/blob/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java#L4852 > We need at least that many reducers or if limited, switch to multi-spray (as > implemented already), but more reducers is wasteful as long as the HiveKey > only contains the partition columns. > At this point, we're limited to reducers = n-bucket still, which is a problem > for partitioning requests which need to insert nearly a terabyte of data into > a single-digit bucket count and four-digit partition count. > Since that is routed by the hasCode of the HiveKey, we can ensure that works > by modifying the HiveKey to handle n-buckets internally. > Basically it should only generate hashCode = (sort_cols.hashCode() % n) > routing only to n reducers over-all, despite how many we spin up. > So far so good with the hard-coded reducer count. > But provided we fix the issues brought up by HIVE-5169, the insert becomes > friendlier to a higher reducer count as well. > At this juncture, we can modify the hashCode to be slightly more interesting. > hashCode = (part_cols.hashCode()*31 + (sort_cols.hashCode() % n)) > This generates somewhere between n to partition_count * n unique hash-codes. > Since the sort-order & bucketing has to be maintained per-partition dir, > distributing this equally across any number of reducers will result in the > scale-out of the reducer count. > This will allow a reducer count that will allow for far faster inserts of ORC > data into a partitioned/sorted table. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira