[GitHub] [hudi] minihippo commented on pull request #3173: [HUDI-1951] Add bucket hash index, compatible with the hive bucket

2021-12-30 Thread GitBox


minihippo commented on pull request #3173:
URL: https://github.com/apache/hudi/pull/3173#issuecomment-1003084971


   > @minihippo I was thinking we can name all parameters 
`hoodie.storage.layout..` instead, but the space curve PRs are all named 
`hoodie.layout.optimize` anyway. So I think its ok
   
   I didn't modify the `hoodie.layout.optimize` directly, considering the 
history config compatibility.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] minihippo commented on pull request #3173: [HUDI-1951] Add bucket hash index, compatible with the hive bucket

2021-12-29 Thread GitBox


minihippo commented on pull request #3173:
URL: https://github.com/apache/hudi/pull/3173#issuecomment-1002609673


   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] minihippo commented on pull request #3173: [HUDI-1951] Add bucket hash index, compatible with the hive bucket

2021-12-28 Thread GitBox


minihippo commented on pull request #3173:
URL: https://github.com/apache/hudi/pull/3173#issuecomment-1002355744


   @vinothchandar I addressed all comments and the failure ut is not related 
with this pr. Can we land this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] minihippo commented on pull request #3173: [HUDI-1951] Add bucket hash index, compatible with the hive bucket

2021-12-17 Thread GitBox


minihippo commented on pull request #3173:
URL: https://github.com/apache/hudi/pull/3173#issuecomment-996824527


   The main changes are:
   1. introduce the `layout` entry to constraint the write behavior
   2. remove the abstraction of hash function, using jvm hashcode instead to 
make it simple
   3. remove the changes about spark MergeOnReadRDD


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] minihippo commented on pull request #3173: [HUDI-1951] Add bucket hash index, compatible with the hive bucket

2021-12-02 Thread GitBox


minihippo commented on pull request #3173:
URL: https://github.com/apache/hudi/pull/3173#issuecomment-984680149


   Progress Update:
   - support consecutive insertions  cc @YuweiXiao @leesf 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] minihippo commented on pull request #3173: [HUDI-1951] Add bucket hash index, compatible with the hive bucket

2021-11-22 Thread GitBox


minihippo commented on pull request #3173:
URL: https://github.com/apache/hudi/pull/3173#issuecomment-976190383


   > Hi @nsivabalan, I've fixed all comments. The main changes are:
   > 
   > 1. Unify bucket index configurations to the HoodieIndexConfig
   > 2. On the premise that bucket index key has to be the subset of the record 
key, get the index key value at the runtime from HoodieKey by a tricky way 
without destroying the data structure. `BucketIdentifier` is introduced to do 
it.
   > 3. When `tag location`, cache the partial filesystem view in each Spark 
task. The implementation is different from bloom index which caches hoodie key 
and file name first and then join with the input data. Bucket Index is proposed 
to processing more bigger data and join is a heavy operation. Therefore, 
hoodieRecordRDD to taggedRecordRDD is a mapPartition only operation.
   
   @vinothchandar, here is the summary after all comments addressed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] minihippo commented on pull request #3173: [HUDI-1951] Add bucket hash index, compatible with the hive bucket

2021-11-22 Thread GitBox


minihippo commented on pull request #3173:
URL: https://github.com/apache/hudi/pull/3173#issuecomment-976189582


   > @minihippo Wondering where we are on this. We can get this in to 0.10 if 
the changes are mostly isolated. let me know.
   
   @vinothchandar thanks for replying. Does `isolated` mean that bucket index 
will not affect basic functions and other features? Whether it is used is 
controlled by the switch.
   Currently, i addressed all comments added by @nsivabalan, but i'm not sure 
the changes are acceptable to the community.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] minihippo commented on pull request #3173: [HUDI-1951] Add bucket hash index, compatible with the hive bucket

2021-11-16 Thread GitBox


minihippo commented on pull request #3173:
URL: https://github.com/apache/hudi/pull/3173#issuecomment-970346845


   > Hi @nsivabalan, I've fixed all comments. The main changes are:
   > 
   > 1. Unify bucket index configurations to the HoodieIndexConfig
   > 2. On the premise that bucket index key has to be the subset of the record 
key, get the index key value at the runtime from HoodieKey by a tricky way 
without destroying the data structure. `BucketIdentifier` is introduced to do 
it.
   > 3. When `tag location`, cache the partial filesystem view in each Spark 
task. The implementation is different from bloom index which caches hoodie key 
and file name first and then join with the input data. Bucket Index is proposed 
to processing more bigger data and join is a heavy operation. Therefore, 
hoodieRecordRDD to taggedRecordRDD is a mapPartition only operation.
   
   friendly ping @nsivabalan,  could you please take a look in your spare time? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] minihippo commented on pull request #3173: [HUDI-1951] Add bucket hash index, compatible with the hive bucket

2021-11-15 Thread GitBox


minihippo commented on pull request #3173:
URL: https://github.com/apache/hudi/pull/3173#issuecomment-969886594


   @hudi-bot run azure
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] minihippo commented on pull request #3173: [HUDI-1951] Add bucket hash index, compatible with the hive bucket

2021-11-14 Thread GitBox


minihippo commented on pull request #3173:
URL: https://github.com/apache/hudi/pull/3173#issuecomment-968469713


   Hi @nsivabalan, I've fixed all comments. The main changes are:
   1. Unify bucket index configurations to the HoodieIndexConfig
   2. On the premise that bucket index key has to be the subset of the record 
key, get the index key value at the runtime from HoodieKey without destroy the 
data structure. `BucketIdentifier` is introduced to do it.
   3. When `tag location`, cache the partial filesystem view in each Spark 
task. The implementation is different from bloom index which cache hoodie key 
and file name first and then join the input data. Bucket Index is proposed to 
processing more bigger data and join is a heavy operation. Therefore, 
hoodieRecordRDD to taggedRecordRDD is a mapPartition only operation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] minihippo commented on pull request #3173: [HUDI-1951] Add bucket hash index, compatible with the hive bucket

2021-11-09 Thread GitBox


minihippo commented on pull request #3173:
URL: https://github.com/apache/hudi/pull/3173#issuecomment-964215135


   Hi @nsivabalan,thanks for review. According to the comments, one doubtful is 
where the index key is placed, deserialize each time if using, or save it once 
deserialize?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] minihippo commented on pull request #3173: [HUDI-1951] Add bucket hash index, compatible with the hive bucket

2021-11-09 Thread GitBox


minihippo commented on pull request #3173:
URL: https://github.com/apache/hudi/pull/3173#issuecomment-964132779


   > Thanks for the contribution. Have left some comments. And IIUC, not all 
clustering strategies may sit well with this bucket index. Some clustering are 
intended to create new file groups which may not work out in case of bucket 
index right? We should call this out somehwere as to what are the constraints 
with this type of index. like every growing file sizes, no small file handling, 
etc.
   
   Yes, you are right. Should I add a doc to the index config to explain the 
constraints?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] minihippo commented on pull request #3173: [HUDI-1951] Add bucket hash index, compatible with the hive bucket

2021-11-09 Thread GitBox


minihippo commented on pull request #3173:
URL: https://github.com/apache/hudi/pull/3173#issuecomment-964128144


   > @minihippo : While I am starting to review, curious to know if you get a 
chance to do any perf analysis with existing bloom index.
   
   In our case, an unpartitioned table with 4 file group, over 30TB and 
more than 500 billion records cannot run successfully with bloom index. 
Besides, hudi is 0.6.0.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] minihippo commented on pull request #3173: [HUDI-1951] Add bucket hash index, compatible with the hive bucket

2021-10-27 Thread GitBox


minihippo commented on pull request #3173:
URL: https://github.com/apache/hudi/pull/3173#issuecomment-953044953


   > @minihippo yes. we can target this for 0.10.0 . Your take is that this 
should be ready for landing, after which we can do the follow ons?
   
   What should i do for landing?like e2e can run or anything else?
   Before landing, does it still need another round of code review?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] minihippo commented on pull request #3173: [HUDI-1951] Add bucket hash index, compatible with the hive bucket

2021-10-26 Thread GitBox


minihippo commented on pull request #3173:
URL: https://github.com/apache/hudi/pull/3173#issuecomment-951743238


   Hi @vinothchandar @leesf , sorry for the long delay caused by my own 
personal reason. Recently I will focus more on the hudi community. Can we 
accelerate the patch merge into master together to catch the 0.10.0 release?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] minihippo commented on pull request #3173: [HUDI-1951] Add bucket hash index, compatible with the hive bucket

2021-09-12 Thread GitBox


minihippo commented on pull request #3173:
URL: https://github.com/apache/hudi/pull/3173#issuecomment-917882872


   Hi all, i will fix all comments in this week


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] minihippo commented on pull request #3173: [HUDI-1951] Add bucket hash index, compatible with the hive bucket

2021-09-01 Thread GitBox


minihippo commented on pull request #3173:
URL: https://github.com/apache/hudi/pull/3173#issuecomment-909949793


   > @minihippo Will do. Apologize if I am slow in the next few days. trying to 
push out the 0.9.0 RC
   
   Hi @vinothchandar , do you have any time to review this pr?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] minihippo commented on pull request #3173: [HUDI-1951] Add bucket hash index, compatible with the hive bucket

2021-08-31 Thread GitBox


minihippo commented on pull request #3173:
URL: https://github.com/apache/hudi/pull/3173#issuecomment-909949793


   > @minihippo Will do. Apologize if I am slow in the next few days. trying to 
push out the 0.9.0 RC
   
   Hi @vinothchandar , do you have any time to review this pr?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] minihippo commented on pull request #3173: [HUDI-1951] Add bucket hash index, compatible with the hive bucket

2021-08-25 Thread GitBox


minihippo commented on pull request #3173:
URL: https://github.com/apache/hudi/pull/3173#issuecomment-905536202


   > @minihippo would you please address the comments above?
   
   done


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] minihippo commented on pull request #3173: [HUDI-1951] Add bucket hash index, compatible with the hive bucket

2021-08-18 Thread GitBox


minihippo commented on pull request #3173:
URL: https://github.com/apache/hudi/pull/3173#issuecomment-900406209


   HI @vinothchandar , is there any comments?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] minihippo commented on pull request #3173: [HUDI-1951] Add bucket hash index, compatible with the hive bucket

2021-08-17 Thread GitBox


minihippo commented on pull request #3173:
URL: https://github.com/apache/hudi/pull/3173#issuecomment-900406209


   HI @vinothchandar , is there any comments?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] minihippo commented on pull request #3173: [HUDI-1951] Add bucket hash index, compatible with the hive bucket

2021-08-10 Thread GitBox


minihippo commented on pull request #3173:
URL: https://github.com/apache/hudi/pull/3173#issuecomment-896477727


   Hi @vinothchandar, can you take a look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] minihippo commented on pull request #3173: [HUDI-1951] Add bucket hash index, compatible with the hive bucket

2021-07-28 Thread GitBox


minihippo commented on pull request #3173:
URL: https://github.com/apache/hudi/pull/3173#issuecomment-11913


   Hi @leesf, i wil fix the unresolved comments in these two days


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] minihippo commented on pull request #3173: [HUDI-1951] Add bucket hash index, compatible with the hive bucket

2021-07-12 Thread GitBox


minihippo commented on pull request #3173:
URL: https://github.com/apache/hudi/pull/3173#issuecomment-878023036


   Hi @leesf, I consider the patch is too large. Should I divided it into 2 pr 
for better review?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] minihippo commented on pull request #3173: [HUDI-1951] Add bucket hash index, compatible with the hive bucket

2021-07-02 Thread GitBox


minihippo commented on pull request #3173:
URL: https://github.com/apache/hudi/pull/3173#issuecomment-872866013


   @vinothchandar done


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] minihippo commented on pull request #3173: [HUDI-1951] Add bucket hash index, compatible with the hive bucket

2021-07-01 Thread GitBox


minihippo commented on pull request #3173:
URL: https://github.com/apache/hudi/pull/3173#issuecomment-872104738


   The `ITTestHoodieDemo` failure is due to its own instability rather than 
being affected by this new feature, and HUDI-2113 has already fixed it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org