hudi-bot opened a new issue, #15692:
URL: https://github.com/apache/hudi/issues/15692

   Let take a look at [Consistent Hash Index RFC: Bucket Resizing| 
https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md#bucket-resizing-splitting--merging]
   I Copy Bucket resizing part as below:
   
   {panel:title=*Bucket Resizing (Splitting & Merging)*}
   Considering there is a semantic similarity between bucket resizing and 
clustering (i.e., re-organizing small data files), this proposal plans to 
integrate the resizing process as a subtask into the clustering service. The 
trigger condition for resizing directly depends on the file size, where small 
files will be merged and large files will be split.
   For merging files, we require that the buckets should be adjacent to each 
other in terms of their hash ranges so that the output bucket has only one 
continuous hash range. Though it is not required in a standard Consistent 
Hashing algorithm, fragmentations in hash ranges may cause extra complexity for 
the splitting process in our case.
   For splitting files, a split point (i.e., hash ranges for the output 
buckets) should be decided:
   * A simple policy would be split in the range middle, but it may produce 
uneven data files. In an extreme case, splitting may produce one file with all 
data and one file with no data.
   * Another policy is to find a split point that evenly dispatches records 
into children buckets. It requires knowledge about the hash value distribution 
of the original buckets.
   
   *In our implementation, we will first stick to the first simple one, as 
buckets will finally converge to a balanced distribution after multiple rounds 
of resizing processes. Of course, a pluggable implementation will be kept for 
extensibility so that users can choose different available policies.*
   All updates related to the hash metadata will be first recorded in the 
clustering plan, and then be reflected in partitions' hashing metadata when 
clustering finishes. The plan is generated and persisted in files during the 
scheduling process, which is protected by a table-level lock for a consistent 
table view.
   {panel}
   
   as described,  I also check the codes in the master branch, it uses the 
first policy which will produce uneven data files first, and buckets will 
finally converge to a balanced distribution after multiple rounds of resizing 
processes.  but when I use this policy in the production env, I found it will 
cause OOM issues very often, since compaction cannot compact very big files 
with a huge amount of record keys, User also cannot read this MergeOnRead table 
with uneven data files on Spark or Presto (Now Consistent hash index cannot use 
on COW table)
   
   is there any progress on the second policy? IMO, I feel the split point 
should be better before the uneven files was written
   
   
   
   
   ## JIRA info
   
   - Link: https://issues.apache.org/jira/browse/HUDI-5560
   - Type: Improvement
   - Epic: https://issues.apache.org/jira/browse/HUDI-3000


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to