[ 
https://issues.apache.org/jira/browse/HIVE-26674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

László Végh updated HIVE-26674:
-------------------------------
    Description: 
h2. Problem statement: 

Without explicit bucketing defined, bucket files are very sensitive to the 
amount of data loaded/modified in the table. 

When 
 * there are initial or larger time-window loads or reloads beside smaller load 
schedules (like initial and monthly vs. daily loads)
 * or even if load scheduling is periodic but the volume of the data changes 
are not, 
 * or even if data volume and periodicity are all balanced but runtime 
resources affect the loader application to run on different number of tasks

The data loaded into non-explicitly bucketed full-acid ORC tables can lead to 
unbalanced bucketed tables over time!

The number of buckets is calculated from the amount of data to be loaded. If 
the table is created with a huge amount of initial data (which will create 
several buckets), and then only a few records are added to it (which will be 
written only into the first 1-2 buckets), but frequently, the result will be 
that the data is unbalanced within the buckets. The first few buckets will 
contain much more data than the others.
h2. Concept:
h4. Rebalancing compaction

A new compaction type (‘REBALANCE’) should be created to address the issue for 
badly balanced data among buckets. This compaction type would result in a table 
like an INSERT-OVERWRITE would lead to. New base and independent bucket indexes 
from the previous base or deltas. The new number of buckets can be optionally 
supplied, otherwise the new table would still have the same amount of buckets, 
but with re-balanced data.
h4. Sorting

Optionally, a sorting expression can be supplied, to be able to re-sort the 
data during the rebalance.

The expression can be supplied in two ways:
 * Via the ALTER TABLE COMPACT:
ALTER TABLE COMPACT <table> ‘REBALANCE’ ORDER BY <column> ASC|DESC

h4. Manual rebalance

The rebalance request can be created by using the ALTER TABLE COMPACT command 
(E.g. manual compaction).
h4. Limitations
 * Rebalancing can be done only within partitions.
 * Rebalancing is not possible on explicitly bucketed (clustered) tables
 * Rebalancing is not possible via MR based compaction
 * Rebalancing is not supported on insert-only tables

h2. Implications
h4. Compaction request (DB schema) changes
 * A new compaction type (REBALANCE) must be added to the allowed compaction 
TYPES.
 * A new optional field (and nullable DB column) is required to store the 
number of requested implicit buckets.

h4. ALTER TABLE COMPACT changes

The ALTER TABLE COMPACT command must accept the 
 * ‘REBALANCE’, compaction type 
 * optionally the new number of the required buckets (... INTO \{N} BUCKETS).
 * Optionally the sorting expression (ORDER BY column ASC, columnB DESC)

h4. Compactor changes

Both the MR and query based compaction tasks must be enhanced with the ability 
to do a rebalancing compaction.
h4. Query based compaction changes

New compactor implementations are required:
 * Query based rebalance compactor for fully acid tables

h4. MR based compaction changes

MR is deprecated, rebalancing compaction will only be implemented, if it’s 
really easy to do so.
h2. Open points

  was:
h2. Problem statement: 

Without explicit bucketing defined, bucket files are very sensitive to the 
amount of data loaded/modified in the table. 

When 
 * there are initial or larger time-window loads or reloads beside smaller load 
schedules (like initial and monthly vs. daily loads)
 * or even if load scheduling is periodic but the volume of the data changes 
are not, 
 * or even if data volume and periodicity are all balanced but runtime 
resources affect the loader application to run on different number of tasks

The data loaded into non-explicitly bucketed full-acid ORC tables can lead to 
unbalanced bucketed tables over time!

The number of buckets is calculated from the amount of data to be loaded. If 
the table is created with a huge amount of initial data (which will create 
several buckets), and then only a few records are added to it (which will be 
written only into the first 1-2 buckets), but frequently, the result will be 
that the data is unbalanced within the buckets. The first few buckets will 
contain much more data than the others.
h2. Concept:
h4. Rebalancing compaction

A new compaction type (‘REBALANCE’) should be created to address the issue for 
badly balanced data among buckets. This compaction type would result in a table 
like an INSERT-OVERWRITE would lead to. New base and independent bucket indexes 
from the previous base or deltas. The new number of buckets can be optionally 
supplied, otherwise the new table would still have the same amount of buckets, 
but with re-balanced data.
h4. Sorting

Optionally, a sorting expression can be supplied, to be able to re-sort the 
data during the rebalance.

The expression can be supplied in two ways:
 * Via the ALTER TABLE COMPACT:
ALTER TABLE COMPACT <table> ‘REBALANCE’ ORDER BY <column> ASC|DESC
 * Via table property:
hive.compactor.rebalance.orderby=<column> ASC|DESC

h4. Manual rebalance

The rebalance request can be created by using the ALTER TABLE COMPACT command 
(E.g. manual compaction).
h4. Automatic rebalance

The rebalance request also can be created by Initiator based on a new set of 
thresholds, evaluated in the following order:
 * Minimum size of the table
 * Let’s say that the threshold is 100MB
 * If the table is smaller than 100MB, then rebalancing compaction won’t 
initiated regardless of the data balance
 * Default value: 100MB
 * This threshold is to skip small files when it is likely to have bigger 
differences in bucket size, but is not worth rebalancing them.

 * Relative standard deviation (RSD) of the bucket file sizes (percentage). If 
the RSD is higher than a predefined value, a rebalancing compaction is 
scheduled. For example:
 * Let’s say that the threshold is 0.2.
 * If the standard deviation of the bucket file size is larger than 20% of the 
average file size, a rebalancing compaction is required.
 * Default value: 0.2
 * This threshold is to detect tables which are getting overall unbalanced

h4. Limitations
 * Rebalancing can be done only within partitions.
 * Rebalancing is not possible on explicitly bucketed (clustered) tables
 * Rebalancing is not possible via MR based compaction
 * In the first version, for rebalance compaction requests on insert-only 
tables, a MAJOR compaction will be run as a fallback. MAJOR compaction on 
insert-only tables already does a rebalance, however the number of buckets 
cannot be set explicitly, it will be calculated by TEZ. Allowing to set the 
number of buckets seems to be a tricky one, so it will be implemented later in 
a separate task. For now, rebalance on insert-only tables will ignore the 
number of buckets set.

h2. Implications
h4. Compaction request (DB schema) changes
 * A new compaction type (REBALANCE) must be added to the allowed compaction 
TYPES.
 * A new optional field (and nullable DB column) is required to store the 
number of requested implicit buckets.

h4. Initiator changes

The Initiator must be able to calculate the REBALANCE thresholds, and initiate 
a rebalancing compaction if required.
h4. ALTER TABLE COMPACT changes

The ALTER TABLE COMPACT command must accept the 
 * ‘REBALANCE’, compaction type 
 * optionally the new number of the required buckets (... INTO \{N} BUCKETS).
 * Optionally the sorting expression (ORDER BY column ASC, columnB DESC)

h4. Compactor changes

Both the MR and query based compaction tasks must be enhanced with the ability 
to do a rebalancing compaction.
h4. Query based compaction changes

New compactor implementations are required:
 * Query based rebalance compactor for fully acid tables
 * Query based rebalance compactor for insert-only tables

 
h4. MR based compaction changes

MR is deprecated, rebalancing compaction will only be implemented, if it’s 
really easy to do so.
h2. Open points


> REBALANCE type compaction
> -------------------------
>
>                 Key: HIVE-26674
>                 URL: https://issues.apache.org/jira/browse/HIVE-26674
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: László Végh
>            Assignee: László Végh
>            Priority: Major
>              Labels: compaction
>
> h2. Problem statement: 
> Without explicit bucketing defined, bucket files are very sensitive to the 
> amount of data loaded/modified in the table. 
> When 
>  * there are initial or larger time-window loads or reloads beside smaller 
> load schedules (like initial and monthly vs. daily loads)
>  * or even if load scheduling is periodic but the volume of the data changes 
> are not, 
>  * or even if data volume and periodicity are all balanced but runtime 
> resources affect the loader application to run on different number of tasks
> The data loaded into non-explicitly bucketed full-acid ORC tables can lead to 
> unbalanced bucketed tables over time!
> The number of buckets is calculated from the amount of data to be loaded. If 
> the table is created with a huge amount of initial data (which will create 
> several buckets), and then only a few records are added to it (which will be 
> written only into the first 1-2 buckets), but frequently, the result will be 
> that the data is unbalanced within the buckets. The first few buckets will 
> contain much more data than the others.
> h2. Concept:
> h4. Rebalancing compaction
> A new compaction type (‘REBALANCE’) should be created to address the issue 
> for badly balanced data among buckets. This compaction type would result in a 
> table like an INSERT-OVERWRITE would lead to. New base and independent bucket 
> indexes from the previous base or deltas. The new number of buckets can be 
> optionally supplied, otherwise the new table would still have the same amount 
> of buckets, but with re-balanced data.
> h4. Sorting
> Optionally, a sorting expression can be supplied, to be able to re-sort the 
> data during the rebalance.
> The expression can be supplied in two ways:
>  * Via the ALTER TABLE COMPACT:
> ALTER TABLE COMPACT <table> ‘REBALANCE’ ORDER BY <column> ASC|DESC
> h4. Manual rebalance
> The rebalance request can be created by using the ALTER TABLE COMPACT command 
> (E.g. manual compaction).
> h4. Limitations
>  * Rebalancing can be done only within partitions.
>  * Rebalancing is not possible on explicitly bucketed (clustered) tables
>  * Rebalancing is not possible via MR based compaction
>  * Rebalancing is not supported on insert-only tables
> h2. Implications
> h4. Compaction request (DB schema) changes
>  * A new compaction type (REBALANCE) must be added to the allowed compaction 
> TYPES.
>  * A new optional field (and nullable DB column) is required to store the 
> number of requested implicit buckets.
> h4. ALTER TABLE COMPACT changes
> The ALTER TABLE COMPACT command must accept the 
>  * ‘REBALANCE’, compaction type 
>  * optionally the new number of the required buckets (... INTO \{N} BUCKETS).
>  * Optionally the sorting expression (ORDER BY column ASC, columnB DESC)
> h4. Compactor changes
> Both the MR and query based compaction tasks must be enhanced with the 
> ability to do a rebalancing compaction.
> h4. Query based compaction changes
> New compactor implementations are required:
>  * Query based rebalance compactor for fully acid tables
> h4. MR based compaction changes
> MR is deprecated, rebalancing compaction will only be implemented, if it’s 
> really easy to do so.
> h2. Open points



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to