[ 
https://issues.apache.org/jira/browse/HUDI-5560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian Feng updated HUDI-5560:
----------------------------
    Summary: Make Consistent hash index Bucket Resizing more available on real 
cases   (was: Make Consistent hash index more available on real cases )

> Make Consistent hash index Bucket Resizing more available on real cases 
> ------------------------------------------------------------------------
>
>                 Key: HUDI-5560
>                 URL: https://issues.apache.org/jira/browse/HUDI-5560
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: index
>            Reporter: Jian Feng
>            Priority: Major
>
> Bucket Resizing (Splitting & Merging)
> Considering there is a semantic similarity between bucket resizing and 
> clustering (i.e., re-organizing small data files), this proposal plans to 
> integrate the resizing process as a subtask into the clustering service. The 
> trigger condition for resizing directly depends on the file size, where small 
> files will be merged and large files will be split.
> For merging files, we require that the buckets should be adjacent to each 
> other in terms of their hash ranges so that the output bucket has only one 
> continuous hash range. Though it is not required in a standard Consistent 
> Hashing algorithm, fragmentations in hash ranges may cause extra complexity 
> for the splitting process in our case.
> For splitting files, a split point (i.e., hash ranges for the output buckets) 
> should be decided:
> A simple policy would be split in the range middle, but it may produce uneven 
> data files. In an extreme case, splitting may produce one file with all data 
> and one file with no data.
> Another policy is to find a split point that evenly dispatches records into 
> children buckets. It requires knowledge about the hash value distribution of 
> the original buckets.
> In our implementation, we will first stick to the first simple one, as 
> buckets will finally converge to a balanced distribution after multiple 
> rounds of resizing processes. Of course, a pluggable implementation will be 
> kept for extensibility so that users can choose different available policies.
> All updates related to the hash metadata will be first recorded in the 
> clustering plan, and then be reflected in partitions' hashing metadata when 
> clustering finishes. The plan is generated and persisted in files during the 
> scheduling process, which is protected by a table-level lock for a consistent 
> table view.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to