[GitHub] [hudi] minihippo commented on pull request #3173: [HUDI-1951] Add bucket hash index, compatible with the hive bucket

GitBox Sun, 14 Nov 2021 18:49:27 -0800


minihippo commented on pull request #3173:
URL: https://github.com/apache/hudi/pull/3173#issuecomment-968469713



   Hi @nsivabalan, I've fixed all comments. The main changes are:
   1. Unify bucket index configurations to the HoodieIndexConfig
   2. On the premise that bucket index key has to be the subset of the record 
key, get the index key value at the runtime from HoodieKey without destroy the 
data structure. `BucketIdentifier` is introduced to do it.
   3. When `tag location`, cache the partial filesystem view in each Spark 
task. The implementation is different from bloom index which cache hoodie key 
and file name first and then join the input data. Bucket Index is proposed to 
processing more bigger data and join is a heavy operation. Therefore, 
hoodieRecordRDD to taggedRecordRDD is a mapPartition only operation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] minihippo commented on pull request #3173: [HUDI-1951] Add bucket hash index, compatible with the hive bucket

Reply via email to