minihippo edited a comment on pull request #3173:
URL: https://github.com/apache/hudi/pull/3173#issuecomment-968469713


   Hi @nsivabalan, I've fixed all comments. The main changes are:
   1. Unify bucket index configurations to the HoodieIndexConfig
   2. On the premise that bucket index key has to be the subset of the record 
key, get the index key value at the runtime from HoodieKey by a tricky way 
without destroying the data structure. `BucketIdentifier` is introduced to do 
it.
   3. When `tag location`, cache the partial filesystem view in each Spark 
task. The implementation is different from bloom index which cache hoodie key 
and file name first and then join the input data. Bucket Index is proposed to 
processing more bigger data and join is a heavy operation. Therefore, 
hoodieRecordRDD to taggedRecordRDD is a mapPartition only operation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to