[ 
https://issues.apache.org/jira/browse/HIVE-26674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

László Végh resolved HIVE-26674.
--------------------------------
       Fix Version/s: 4.0.0
    Target Version/s: 4.0.0
          Resolution: Fixed

> REBALANCE type compaction
> -------------------------
>
>                 Key: HIVE-26674
>                 URL: https://issues.apache.org/jira/browse/HIVE-26674
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: László Végh
>            Assignee: László Végh
>            Priority: Major
>              Labels: compaction
>             Fix For: 4.0.0
>
>
> h2. Problem statement: 
> Without explicit bucketing defined, bucket files are very sensitive to the 
> amount of data loaded/modified in the table. 
> When 
>  * there are initial or larger time-window loads or reloads beside smaller 
> load schedules (like initial and monthly vs. daily loads)
>  * or even if load scheduling is periodic but the volume of the data changes 
> are not, 
>  * or even if data volume and periodicity are all balanced but runtime 
> resources affect the loader application to run on different number of tasks
> The data loaded into non-explicitly bucketed full-acid ORC tables can lead to 
> unbalanced bucketed tables over time!
> The number of buckets is calculated from the amount of data to be loaded. If 
> the table is created with a huge amount of initial data (which will create 
> several buckets), and then only a few records are added to it (which will be 
> written only into the first 1-2 buckets), but frequently, the result will be 
> that the data is unbalanced within the buckets. The first few buckets will 
> contain much more data than the others.
> h2. Concept:
> h4. Rebalancing compaction
> A new compaction type (‘REBALANCE’) should be created to address the issue 
> for badly balanced data among buckets. This compaction type would result in a 
> table like an INSERT-OVERWRITE would lead to. New base and independent bucket 
> indexes from the previous base or deltas. The new number of buckets can be 
> optionally supplied, otherwise the new table would still have the same amount 
> of buckets, but with re-balanced data.
> h4. Sorting
> Optionally, a sorting expression can be supplied, to be able to re-sort the 
> data during the rebalance.
> The expression can be supplied in two ways:
>  * Via the ALTER TABLE COMPACT:
> ALTER TABLE COMPACT <table> ‘REBALANCE’ ORDER BY <column> ASC|DESC
> h4. Manual rebalance
> The rebalance request can be created by using the ALTER TABLE COMPACT command 
> (E.g. manual compaction).
> h4. Limitations
>  * Rebalancing can be done only within partitions.
>  * Rebalancing is not possible on explicitly bucketed (clustered) tables
>  * Rebalancing is not possible via MR based compaction
>  * Rebalancing is not supported on insert-only tables
> h2. Implications
> h4. Compaction request (DB schema) changes
>  * A new compaction type (REBALANCE) must be added to the allowed compaction 
> TYPES.
>  * A new optional field (and nullable DB column) is required to store the 
> number of requested implicit buckets.
> h4. ALTER TABLE COMPACT changes
> The ALTER TABLE COMPACT command must accept the 
>  * ‘REBALANCE’, compaction type 
>  * optionally the new number of the required buckets (... INTO \{N} BUCKETS).
>  * Optionally the sorting expression (ORDER BY column ASC, columnB DESC)
> h4. Compactor changes
> Both the MR and query based compaction tasks must be enhanced with the 
> ability to do a rebalancing compaction.
> h4. Query based compaction changes
> New compactor implementations are required:
>  * Query based rebalance compactor for fully acid tables
> h4. MR based compaction changes
> MR is deprecated, rebalancing compaction will only be implemented, if it’s 
> really easy to do so.
> h2. Open points



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to