[ 
https://issues.apache.org/jira/browse/HUDI-5560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-5560:
-----------------------------
    Epic Link: HUDI-3000

> Make Consistent hash index Bucket Resizing more available on real cases 
> ------------------------------------------------------------------------
>
>                 Key: HUDI-5560
>                 URL: https://issues.apache.org/jira/browse/HUDI-5560
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: index
>            Reporter: Jian Feng
>            Priority: Major
>
> Let take a look at [Consistent Hash Index RFC: Bucket Resizing| 
> https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md#bucket-resizing-splitting--merging]
> I Copy Bucket resizing part as below:
> {panel:title=*Bucket Resizing (Splitting & Merging)*}
> Considering there is a semantic similarity between bucket resizing and 
> clustering (i.e., re-organizing small data files), this proposal plans to 
> integrate the resizing process as a subtask into the clustering service. The 
> trigger condition for resizing directly depends on the file size, where small 
> files will be merged and large files will be split.
> For merging files, we require that the buckets should be adjacent to each 
> other in terms of their hash ranges so that the output bucket has only one 
> continuous hash range. Though it is not required in a standard Consistent 
> Hashing algorithm, fragmentations in hash ranges may cause extra complexity 
> for the splitting process in our case.
> For splitting files, a split point (i.e., hash ranges for the output buckets) 
> should be decided:
> * A simple policy would be split in the range middle, but it may produce 
> uneven data files. In an extreme case, splitting may produce one file with 
> all data and one file with no data.
> * Another policy is to find a split point that evenly dispatches records into 
> children buckets. It requires knowledge about the hash value distribution of 
> the original buckets.
> *In our implementation, we will first stick to the first simple one, as 
> buckets will finally converge to a balanced distribution after multiple 
> rounds of resizing processes. Of course, a pluggable implementation will be 
> kept for extensibility so that users can choose different available policies.*
> All updates related to the hash metadata will be first recorded in the 
> clustering plan, and then be reflected in partitions' hashing metadata when 
> clustering finishes. The plan is generated and persisted in files during the 
> scheduling process, which is protected by a table-level lock for a consistent 
> table view.
> {panel}
> as described,  I also check the codes in the master branch, it uses the first 
> policy which will produce uneven data files first, and buckets will finally 
> converge to a balanced distribution after multiple rounds of resizing 
> processes.  but when I use this policy in the production env, I found it will 
> cause OOM issues very often, since compaction cannot compact very big files 
> with a huge amount of record keys, User also cannot read this MergeOnRead 
> table with uneven data files on Spark or Presto (Now Consistent hash index 
> cannot use on COW table)
> is there any progress on the second policy? IMO, I feel the split point 
> should be better before the uneven files was written



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to