[ https://issues.apache.org/jira/browse/HUDI-5560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Raymond Xu updated HUDI-5560: ----------------------------- Epic Link: HUDI-3000 > Make Consistent hash index Bucket Resizing more available on real cases > ------------------------------------------------------------------------ > > Key: HUDI-5560 > URL: https://issues.apache.org/jira/browse/HUDI-5560 > Project: Apache Hudi > Issue Type: Improvement > Components: index > Reporter: Jian Feng > Priority: Major > > Let take a look at [Consistent Hash Index RFC: Bucket Resizing| > https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md#bucket-resizing-splitting--merging] > I Copy Bucket resizing part as below: > {panel:title=*Bucket Resizing (Splitting & Merging)*} > Considering there is a semantic similarity between bucket resizing and > clustering (i.e., re-organizing small data files), this proposal plans to > integrate the resizing process as a subtask into the clustering service. The > trigger condition for resizing directly depends on the file size, where small > files will be merged and large files will be split. > For merging files, we require that the buckets should be adjacent to each > other in terms of their hash ranges so that the output bucket has only one > continuous hash range. Though it is not required in a standard Consistent > Hashing algorithm, fragmentations in hash ranges may cause extra complexity > for the splitting process in our case. > For splitting files, a split point (i.e., hash ranges for the output buckets) > should be decided: > * A simple policy would be split in the range middle, but it may produce > uneven data files. In an extreme case, splitting may produce one file with > all data and one file with no data. > * Another policy is to find a split point that evenly dispatches records into > children buckets. It requires knowledge about the hash value distribution of > the original buckets. > *In our implementation, we will first stick to the first simple one, as > buckets will finally converge to a balanced distribution after multiple > rounds of resizing processes. Of course, a pluggable implementation will be > kept for extensibility so that users can choose different available policies.* > All updates related to the hash metadata will be first recorded in the > clustering plan, and then be reflected in partitions' hashing metadata when > clustering finishes. The plan is generated and persisted in files during the > scheduling process, which is protected by a table-level lock for a consistent > table view. > {panel} > as described, I also check the codes in the master branch, it uses the first > policy which will produce uneven data files first, and buckets will finally > converge to a balanced distribution after multiple rounds of resizing > processes. but when I use this policy in the production env, I found it will > cause OOM issues very often, since compaction cannot compact very big files > with a huge amount of record keys, User also cannot read this MergeOnRead > table with uneven data files on Spark or Presto (Now Consistent hash index > cannot use on COW table) > is there any progress on the second policy? IMO, I feel the split point > should be better before the uneven files was written -- This message was sent by Atlassian Jira (v8.20.10#820010)