[ https://issues.apache.org/jira/browse/HIVE-26674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
László Végh resolved HIVE-26674. -------------------------------- Fix Version/s: 4.0.0 Target Version/s: 4.0.0 Resolution: Fixed > REBALANCE type compaction > ------------------------- > > Key: HIVE-26674 > URL: https://issues.apache.org/jira/browse/HIVE-26674 > Project: Hive > Issue Type: Improvement > Reporter: László Végh > Assignee: László Végh > Priority: Major > Labels: compaction > Fix For: 4.0.0 > > > h2. Problem statement: > Without explicit bucketing defined, bucket files are very sensitive to the > amount of data loaded/modified in the table. > When > * there are initial or larger time-window loads or reloads beside smaller > load schedules (like initial and monthly vs. daily loads) > * or even if load scheduling is periodic but the volume of the data changes > are not, > * or even if data volume and periodicity are all balanced but runtime > resources affect the loader application to run on different number of tasks > The data loaded into non-explicitly bucketed full-acid ORC tables can lead to > unbalanced bucketed tables over time! > The number of buckets is calculated from the amount of data to be loaded. If > the table is created with a huge amount of initial data (which will create > several buckets), and then only a few records are added to it (which will be > written only into the first 1-2 buckets), but frequently, the result will be > that the data is unbalanced within the buckets. The first few buckets will > contain much more data than the others. > h2. Concept: > h4. Rebalancing compaction > A new compaction type (‘REBALANCE’) should be created to address the issue > for badly balanced data among buckets. This compaction type would result in a > table like an INSERT-OVERWRITE would lead to. New base and independent bucket > indexes from the previous base or deltas. The new number of buckets can be > optionally supplied, otherwise the new table would still have the same amount > of buckets, but with re-balanced data. > h4. Sorting > Optionally, a sorting expression can be supplied, to be able to re-sort the > data during the rebalance. > The expression can be supplied in two ways: > * Via the ALTER TABLE COMPACT: > ALTER TABLE COMPACT <table> ‘REBALANCE’ ORDER BY <column> ASC|DESC > h4. Manual rebalance > The rebalance request can be created by using the ALTER TABLE COMPACT command > (E.g. manual compaction). > h4. Limitations > * Rebalancing can be done only within partitions. > * Rebalancing is not possible on explicitly bucketed (clustered) tables > * Rebalancing is not possible via MR based compaction > * Rebalancing is not supported on insert-only tables > h2. Implications > h4. Compaction request (DB schema) changes > * A new compaction type (REBALANCE) must be added to the allowed compaction > TYPES. > * A new optional field (and nullable DB column) is required to store the > number of requested implicit buckets. > h4. ALTER TABLE COMPACT changes > The ALTER TABLE COMPACT command must accept the > * ‘REBALANCE’, compaction type > * optionally the new number of the required buckets (... INTO \{N} BUCKETS). > * Optionally the sorting expression (ORDER BY column ASC, columnB DESC) > h4. Compactor changes > Both the MR and query based compaction tasks must be enhanced with the > ability to do a rebalancing compaction. > h4. Query based compaction changes > New compactor implementations are required: > * Query based rebalance compactor for fully acid tables > h4. MR based compaction changes > MR is deprecated, rebalancing compaction will only be implemented, if it’s > really easy to do so. > h2. Open points -- This message was sent by Atlassian Jira (v8.20.10#820010)