GitHub user cshuo edited a comment on the discussion: Dynamic Bucket Index For 
Flink streaming

> 1. the small file profile for assigning new keys to existing buckets, there 
> are two metrics: the row count and file size(file group/base file), let's 
> decide which one do we want here. and we need a way to calculate or estimate 
> the values.

Currently, we already have `BucketAssigner` for assigning buckets based on 
small file profiling, which calculates target maximum row count for each bucket 
by `parquetMaxFileSize / avgRecordSize`. Regarding the first version of dynamic 
bucket index, I think we can reuse the same profile logic.

> 2. the read of partitioned RLI from specific partiiton, is there any read 
> amplification? for e.g, is the partition index mappings scatter among multipe 
> buckets or stored together with other partitions within one RLI bucket.

For partitioned RLI, mappings are organized by data partition. They are not 
mixed together with mappings from other data partitions. The file Group ID 
naming for partitioned RLI is : 
`record-index-<escapedDataPartitionName>-<4-digit fileGroupIndex>-0`
Concretely:
* Bach data partition owns its own partitioned-RLI file group set
* By default, one data partition uses 1 RLI file group, and can be configured 
with a larger value if necessary.

GitHub link: 
https://github.com/apache/hudi/discussions/18514#discussioncomment-16632411

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to