Re: Global Index for partitioned Hudi datasets

2019-03-29 Thread Vinoth Chandar
Great thoughts.. Lets chat more on the HIP. >> I am thinking something like a min/max on the row key for each file. There could be cases where a monotonous increasing id generation service is used when there are new entities BloomIndex already does this today. In addition to Bloom filters, it

Re: Global Index for partitioned Hudi datasets

2019-03-28 Thread Prasanna
Btw I am not able to comment on the jira. I will get this fixed and post the comment on the jira as well. Cheers. On Thu, Mar 28, 2019 at 12:49 AM Prasanna wrote: > Hey Nishith, > > Glad we have a concrete proposal on this. > > My 0.02 thoughts on this. > > What we are really building is an

Re: Global Index for partitioned Hudi datasets

2019-03-28 Thread Prasanna
Hey Nishith, Glad we have a concrete proposal on this. My 0.02 thoughts on this. What we are really building is an approximate indexing system which can help us reduce the number of files to look for when a key is updated. The problem with having something random in the key (like uuid) means

Re: Global Index for partitioned Hudi datasets

2019-03-28 Thread nishith agarwal
Here is the HIP : https://docs.google.com/document/d/1RdxVqF60N9yRUH7HZ-s2Y_aYHLHb9xGrlRLK1OWtYKM/edit?usp=sharing @Vinoth Chandar @balaji added you guys as approvers, please take a look. -Nishith On Tue, Mar 26, 2019 at 9:47 PM nishith agarwal wrote: > JIRA :

Re: Global Index for partitioned Hudi datasets

2019-03-26 Thread nishith agarwal
JIRA : https://issues.apache.org/jira/projects/HUDI/issues/HUDI-53 -Nishith On Tue, Mar 26, 2019 at 9:21 PM nishith agarwal wrote: > All, > > Currently, Hudi supports partitioned and non-partitioned datasets. A > partitioned dataset is one which bucketizes groups of files (data) into >

Global Index for partitioned Hudi datasets

2019-03-26 Thread nishith agarwal
All, Currently, Hudi supports partitioned and non-partitioned datasets. A partitioned dataset is one which bucketizes groups of files (data) into buckets called partitions. A hudi dataset may be composed of N number of partitions with M number of files. This structure helps canonical