[ https://issues.apache.org/jira/browse/HUDI-5560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jian Feng updated HUDI-5560: ---------------------------- Summary: Make Consistent hash index Bucket Resizing more available on real cases (was: Make Consistent hash index more available on real cases ) > Make Consistent hash index Bucket Resizing more available on real cases > ------------------------------------------------------------------------ > > Key: HUDI-5560 > URL: https://issues.apache.org/jira/browse/HUDI-5560 > Project: Apache Hudi > Issue Type: Improvement > Components: index > Reporter: Jian Feng > Priority: Major > > Bucket Resizing (Splitting & Merging) > Considering there is a semantic similarity between bucket resizing and > clustering (i.e., re-organizing small data files), this proposal plans to > integrate the resizing process as a subtask into the clustering service. The > trigger condition for resizing directly depends on the file size, where small > files will be merged and large files will be split. > For merging files, we require that the buckets should be adjacent to each > other in terms of their hash ranges so that the output bucket has only one > continuous hash range. Though it is not required in a standard Consistent > Hashing algorithm, fragmentations in hash ranges may cause extra complexity > for the splitting process in our case. > For splitting files, a split point (i.e., hash ranges for the output buckets) > should be decided: > A simple policy would be split in the range middle, but it may produce uneven > data files. In an extreme case, splitting may produce one file with all data > and one file with no data. > Another policy is to find a split point that evenly dispatches records into > children buckets. It requires knowledge about the hash value distribution of > the original buckets. > In our implementation, we will first stick to the first simple one, as > buckets will finally converge to a balanced distribution after multiple > rounds of resizing processes. Of course, a pluggable implementation will be > kept for extensibility so that users can choose different available policies. > All updates related to the hash metadata will be first recorded in the > clustering plan, and then be reflected in partitions' hashing metadata when > clustering finishes. The plan is generated and persisted in files during the > scheduling process, which is protected by a table-level lock for a consistent > table view. -- This message was sent by Atlassian Jira (v8.20.10#820010)