[GitHub] [hudi] minihippo commented on pull request #3173: [HUDI-1951] Add bucket hash index, compatible with the hive bucket
minihippo commented on pull request #3173: URL: https://github.com/apache/hudi/pull/3173#issuecomment-1003084971 > @minihippo I was thinking we can name all parameters `hoodie.storage.layout..` instead, but the space curve PRs are all named `hoodie.layout.optimize` anyway. So I think its ok I didn't modify the `hoodie.layout.optimize` directly, considering the history config compatibility. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] minihippo commented on pull request #3173: [HUDI-1951] Add bucket hash index, compatible with the hive bucket
minihippo commented on pull request #3173: URL: https://github.com/apache/hudi/pull/3173#issuecomment-1002609673 @hudi-bot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] minihippo commented on pull request #3173: [HUDI-1951] Add bucket hash index, compatible with the hive bucket
minihippo commented on pull request #3173: URL: https://github.com/apache/hudi/pull/3173#issuecomment-1002355744 @vinothchandar I addressed all comments and the failure ut is not related with this pr. Can we land this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] minihippo commented on pull request #3173: [HUDI-1951] Add bucket hash index, compatible with the hive bucket
minihippo commented on pull request #3173: URL: https://github.com/apache/hudi/pull/3173#issuecomment-996824527 The main changes are: 1. introduce the `layout` entry to constraint the write behavior 2. remove the abstraction of hash function, using jvm hashcode instead to make it simple 3. remove the changes about spark MergeOnReadRDD -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] minihippo commented on pull request #3173: [HUDI-1951] Add bucket hash index, compatible with the hive bucket
minihippo commented on pull request #3173: URL: https://github.com/apache/hudi/pull/3173#issuecomment-984680149 Progress Update: - support consecutive insertions cc @YuweiXiao @leesf -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] minihippo commented on pull request #3173: [HUDI-1951] Add bucket hash index, compatible with the hive bucket
minihippo commented on pull request #3173: URL: https://github.com/apache/hudi/pull/3173#issuecomment-976190383 > Hi @nsivabalan, I've fixed all comments. The main changes are: > > 1. Unify bucket index configurations to the HoodieIndexConfig > 2. On the premise that bucket index key has to be the subset of the record key, get the index key value at the runtime from HoodieKey by a tricky way without destroying the data structure. `BucketIdentifier` is introduced to do it. > 3. When `tag location`, cache the partial filesystem view in each Spark task. The implementation is different from bloom index which caches hoodie key and file name first and then join with the input data. Bucket Index is proposed to processing more bigger data and join is a heavy operation. Therefore, hoodieRecordRDD to taggedRecordRDD is a mapPartition only operation. @vinothchandar, here is the summary after all comments addressed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] minihippo commented on pull request #3173: [HUDI-1951] Add bucket hash index, compatible with the hive bucket
minihippo commented on pull request #3173: URL: https://github.com/apache/hudi/pull/3173#issuecomment-976189582 > @minihippo Wondering where we are on this. We can get this in to 0.10 if the changes are mostly isolated. let me know. @vinothchandar thanks for replying. Does `isolated` mean that bucket index will not affect basic functions and other features? Whether it is used is controlled by the switch. Currently, i addressed all comments added by @nsivabalan, but i'm not sure the changes are acceptable to the community. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] minihippo commented on pull request #3173: [HUDI-1951] Add bucket hash index, compatible with the hive bucket
minihippo commented on pull request #3173: URL: https://github.com/apache/hudi/pull/3173#issuecomment-970346845 > Hi @nsivabalan, I've fixed all comments. The main changes are: > > 1. Unify bucket index configurations to the HoodieIndexConfig > 2. On the premise that bucket index key has to be the subset of the record key, get the index key value at the runtime from HoodieKey by a tricky way without destroying the data structure. `BucketIdentifier` is introduced to do it. > 3. When `tag location`, cache the partial filesystem view in each Spark task. The implementation is different from bloom index which caches hoodie key and file name first and then join with the input data. Bucket Index is proposed to processing more bigger data and join is a heavy operation. Therefore, hoodieRecordRDD to taggedRecordRDD is a mapPartition only operation. friendly ping @nsivabalan, could you please take a look in your spare time? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] minihippo commented on pull request #3173: [HUDI-1951] Add bucket hash index, compatible with the hive bucket
minihippo commented on pull request #3173: URL: https://github.com/apache/hudi/pull/3173#issuecomment-969886594 @hudi-bot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] minihippo commented on pull request #3173: [HUDI-1951] Add bucket hash index, compatible with the hive bucket
minihippo commented on pull request #3173: URL: https://github.com/apache/hudi/pull/3173#issuecomment-968469713 Hi @nsivabalan, I've fixed all comments. The main changes are: 1. Unify bucket index configurations to the HoodieIndexConfig 2. On the premise that bucket index key has to be the subset of the record key, get the index key value at the runtime from HoodieKey without destroy the data structure. `BucketIdentifier` is introduced to do it. 3. When `tag location`, cache the partial filesystem view in each Spark task. The implementation is different from bloom index which cache hoodie key and file name first and then join the input data. Bucket Index is proposed to processing more bigger data and join is a heavy operation. Therefore, hoodieRecordRDD to taggedRecordRDD is a mapPartition only operation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] minihippo commented on pull request #3173: [HUDI-1951] Add bucket hash index, compatible with the hive bucket
minihippo commented on pull request #3173: URL: https://github.com/apache/hudi/pull/3173#issuecomment-964215135 Hi @nsivabalan,thanks for review. According to the comments, one doubtful is where the index key is placed, deserialize each time if using, or save it once deserialize? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] minihippo commented on pull request #3173: [HUDI-1951] Add bucket hash index, compatible with the hive bucket
minihippo commented on pull request #3173: URL: https://github.com/apache/hudi/pull/3173#issuecomment-964132779 > Thanks for the contribution. Have left some comments. And IIUC, not all clustering strategies may sit well with this bucket index. Some clustering are intended to create new file groups which may not work out in case of bucket index right? We should call this out somehwere as to what are the constraints with this type of index. like every growing file sizes, no small file handling, etc. Yes, you are right. Should I add a doc to the index config to explain the constraints? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] minihippo commented on pull request #3173: [HUDI-1951] Add bucket hash index, compatible with the hive bucket
minihippo commented on pull request #3173: URL: https://github.com/apache/hudi/pull/3173#issuecomment-964128144 > @minihippo : While I am starting to review, curious to know if you get a chance to do any perf analysis with existing bloom index. In our case, an unpartitioned table with 4 file group, over 30TB and more than 500 billion records cannot run successfully with bloom index. Besides, hudi is 0.6.0. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] minihippo commented on pull request #3173: [HUDI-1951] Add bucket hash index, compatible with the hive bucket
minihippo commented on pull request #3173: URL: https://github.com/apache/hudi/pull/3173#issuecomment-953044953 > @minihippo yes. we can target this for 0.10.0 . Your take is that this should be ready for landing, after which we can do the follow ons? What should i do for landing?like e2e can run or anything else? Before landing, does it still need another round of code review? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] minihippo commented on pull request #3173: [HUDI-1951] Add bucket hash index, compatible with the hive bucket
minihippo commented on pull request #3173: URL: https://github.com/apache/hudi/pull/3173#issuecomment-951743238 Hi @vinothchandar @leesf , sorry for the long delay caused by my own personal reason. Recently I will focus more on the hudi community. Can we accelerate the patch merge into master together to catch the 0.10.0 release? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] minihippo commented on pull request #3173: [HUDI-1951] Add bucket hash index, compatible with the hive bucket
minihippo commented on pull request #3173: URL: https://github.com/apache/hudi/pull/3173#issuecomment-917882872 Hi all, i will fix all comments in this week -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] minihippo commented on pull request #3173: [HUDI-1951] Add bucket hash index, compatible with the hive bucket
minihippo commented on pull request #3173: URL: https://github.com/apache/hudi/pull/3173#issuecomment-909949793 > @minihippo Will do. Apologize if I am slow in the next few days. trying to push out the 0.9.0 RC Hi @vinothchandar , do you have any time to review this pr? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] minihippo commented on pull request #3173: [HUDI-1951] Add bucket hash index, compatible with the hive bucket
minihippo commented on pull request #3173: URL: https://github.com/apache/hudi/pull/3173#issuecomment-909949793 > @minihippo Will do. Apologize if I am slow in the next few days. trying to push out the 0.9.0 RC Hi @vinothchandar , do you have any time to review this pr? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] minihippo commented on pull request #3173: [HUDI-1951] Add bucket hash index, compatible with the hive bucket
minihippo commented on pull request #3173: URL: https://github.com/apache/hudi/pull/3173#issuecomment-905536202 > @minihippo would you please address the comments above? done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] minihippo commented on pull request #3173: [HUDI-1951] Add bucket hash index, compatible with the hive bucket
minihippo commented on pull request #3173: URL: https://github.com/apache/hudi/pull/3173#issuecomment-900406209 HI @vinothchandar , is there any comments? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] minihippo commented on pull request #3173: [HUDI-1951] Add bucket hash index, compatible with the hive bucket
minihippo commented on pull request #3173: URL: https://github.com/apache/hudi/pull/3173#issuecomment-900406209 HI @vinothchandar , is there any comments? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] minihippo commented on pull request #3173: [HUDI-1951] Add bucket hash index, compatible with the hive bucket
minihippo commented on pull request #3173: URL: https://github.com/apache/hudi/pull/3173#issuecomment-896477727 Hi @vinothchandar, can you take a look? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] minihippo commented on pull request #3173: [HUDI-1951] Add bucket hash index, compatible with the hive bucket
minihippo commented on pull request #3173: URL: https://github.com/apache/hudi/pull/3173#issuecomment-11913 Hi @leesf, i wil fix the unresolved comments in these two days -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] minihippo commented on pull request #3173: [HUDI-1951] Add bucket hash index, compatible with the hive bucket
minihippo commented on pull request #3173: URL: https://github.com/apache/hudi/pull/3173#issuecomment-878023036 Hi @leesf, I consider the patch is too large. Should I divided it into 2 pr for better review? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] minihippo commented on pull request #3173: [HUDI-1951] Add bucket hash index, compatible with the hive bucket
minihippo commented on pull request #3173: URL: https://github.com/apache/hudi/pull/3173#issuecomment-872866013 @vinothchandar done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] minihippo commented on pull request #3173: [HUDI-1951] Add bucket hash index, compatible with the hive bucket
minihippo commented on pull request #3173: URL: https://github.com/apache/hudi/pull/3173#issuecomment-872104738 The `ITTestHoodieDemo` failure is due to its own instability rather than being affected by this new feature, and HUDI-2113 has already fixed it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org