minihippo commented on pull request #3173: URL: https://github.com/apache/hudi/pull/3173#issuecomment-968469713
Hi @nsivabalan, I've fixed all comments. The main changes are: 1. Unify bucket index configurations to the HoodieIndexConfig 2. On the premise that bucket index key has to be the subset of the record key, get the index key value at the runtime from HoodieKey without destroy the data structure. `BucketIdentifier` is introduced to do it. 3. When `tag location`, cache the partial filesystem view in each Spark task. The implementation is different from bloom index which cache hoodie key and file name first and then join the input data. Bucket Index is proposed to processing more bigger data and join is a heavy operation. Therefore, hoodieRecordRDD to taggedRecordRDD is a mapPartition only operation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org