[ https://issues.apache.org/jira/browse/HIVE-24715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276424#comment-17276424 ]
Attila Magyar edited comment on HIVE-24715 at 2/1/21, 4:04 PM: --------------------------------------------------------------- Currently the bucketId field is stored in 12 bits. When TEZ starts more tasks than 4095 it overflows. See TEZ-4271 for more context. {code:java} * Represents format of "bucket" property in Hive 3.0. * top 3 bits - version code. * next 1 bit - reserved for future * next 12 bits - the bucket ID * next 4 bits reserved for future {code} Simply increasing the range would have an undesired effect on compaction efficiency. If hundred thousands of tasks are started than we would and up having hundred thousands of files and since compaction works across statement ids it wouldn't merge those. Instead of increasing the range, the proposed solution is to let bucket id overflow into the statement id, so that the 4096th bucket will bucket_0 and it will look like it was created by statement_id+1. This way compaction will be able to merge the same buckets that belong to different statements. was (Author: amagyar): Currently the bucketId field is stored in 12 bits. When TEZ starts more tasks than 4095 it overflows. {code:java} * Represents format of "bucket" property in Hive 3.0. * top 3 bits - version code. * next 1 bit - reserved for future * next 12 bits - the bucket ID * next 4 bits reserved for future {code} Simply increasing the range would have an undesired effect on compaction efficiency. If hundred thousands of tasks are started than we would and up having hundred thousands of files and since compaction works across statement ids it wouldn't merge those. Instead of increasing the range, the proposed solution is to let bucket id overflow into the statement id, so that the 4096th bucket will bucket_0 and it will look like it was created by statement_id+1. This way compaction will be able to merge the same buckets that belong to different statements. > Increase bucketId range > ----------------------- > > Key: HIVE-24715 > URL: https://issues.apache.org/jira/browse/HIVE-24715 > Project: Hive > Issue Type: Bug > Components: HiveServer2 > Reporter: Attila Magyar > Assignee: Attila Magyar > Priority: Major > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)