[GitHub] [hudi] nsivabalan commented on issue #2338: [SUPPORT] MOR table found duplicate and process so slowly

2021-04-02 Thread GitBox


nsivabalan commented on issue #2338:
URL: https://github.com/apache/hudi/issues/2338#issuecomment-812713376


   Closing due to inactivity. but feel free to reopen to create a new ticket. 
would be happy to assist you.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on issue #2338: [SUPPORT] MOR table found duplicate and process so slowly

2021-02-06 Thread GitBox


nsivabalan commented on issue #2338:
URL: https://github.com/apache/hudi/issues/2338#issuecomment-774512527


   @so-lazy : Can you please respond to Balaji's comment when you get a chance. 
   few more questions as we triage the issue.
   When you loaded the data to hudi for the first time, did you use bulk-insert 
of insert operation. Were the configs you used to load the data into hudi first 
time is same as the ones you have given in description of this issue? If not, 
would you mind posting those configs as well.




This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on issue #2338: [SUPPORT] MOR table found duplicate and process so slowly

2021-01-03 Thread GitBox


nsivabalan commented on issue #2338:
URL: https://github.com/apache/hudi/issues/2338#issuecomment-753746016


   @so-lazy : I am looping in @bvaradar to help you out here. But in the mean 
time, some context around Global_Bloom. Hudi has two kinds of indexes, regular 
and global. in regular bloom, all record keys within a partition are unique, 
but there could be same record key across diff partitions. Within same 
partition, hudi will take care of updating the records based on record keys and 
will serve you only the latest snapshot for every record key of interest. 
   Where as in Global versions, record keys across the entire dataset is 
unique. in other words, there can't be same record key in different partitions. 
So, incase you insert a record, rec_1 in partition1 and later try to insert the 
same record(rec_1) to a diff partition, say partition2, Hudi by default will 
update the record in partition1. But there is a config which you can set, on 
which case, hudi will delete this record, rec1 of interest from partition1 and 
will insert to partition2. 
   This is the major difference between regular and global versions of index. 
Since in Global version, all partitions need to be looked up for all records, 
it is known to be less performant compared to regular index. So, unless you 
have this requirement, would suggest you to use regular indexes (BLOOM for ex). 
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on issue #2338: [SUPPORT] MOR table found duplicate and process so slowly

2020-12-23 Thread GitBox


nsivabalan commented on issue #2338:
URL: https://github.com/apache/hudi/issues/2338#issuecomment-750427679


   If you wish to understand diff indexing schemes, please refer to this 
[blog](https://hudi.apache.org/blog/hudi-indexing-mechanisms/).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on issue #2338: [SUPPORT] MOR table found duplicate and process so slowly

2020-12-23 Thread GitBox


nsivabalan commented on issue #2338:
URL: https://github.com/apache/hudi/issues/2338#issuecomment-750427001


   @so-lazy :would u mind elaborating more on your use-case. did you choose 
Global_bloom intentionally?  
   And by this statement of yours "i found much duplicate records,..", did you 
mean to insinuate that compaction hasn't happened and hence you found 
duplicates or did you refer in general your dataset has duplicates? 
   Do you want to do dedup for your use-case in general?  
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org