[GitHub] [hudi] minihippo edited a comment on pull request #3173: [HUDI-1951] Add bucket hash index, compatible with the hive bucket

GitBox Sun, 14 Nov 2021 22:26:45 -0800


minihippo edited a comment on pull request #3173:
URL: https://github.com/apache/hudi/pull/3173#issuecomment-968469713



   Hi @nsivabalan, I've fixed all comments. The main changes are:
   1. Unify bucket index configurations to the HoodieIndexConfig
   2. On the premise that bucket index key has to be the subset of the record 
key, get the index key value at the runtime from HoodieKey by a tricky way 
without destroying the data structure. `BucketIdentifier` is introduced to do 
it.
   3. When `tag location`, cache the partial filesystem view in each Spark 
task. The implementation is different from bloom index which cache hoodie key 
and file name first and then join the input data. Bucket Index is proposed to 
processing more bigger data and join is a heavy operation. Therefore, 
hoodieRecordRDD to taggedRecordRDD is a mapPartition only operation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] minihippo edited a comment on pull request #3173: [HUDI-1951] Add bucket hash index, compatible with the hive bucket

Reply via email to