subject:"Global Index for partitioned Hudi datasets"

Re: Global Index for partitioned Hudi datasets

2019-03-29 Thread Vinoth Chandar

Great thoughts.. Lets chat more on the HIP. >> I am thinking something like a min/max on the row key for each file. There could be cases where a monotonous increasing id generation service is used when there are new entities BloomIndex already does this today. In addition to Bloom filters, it

Re: Global Index for partitioned Hudi datasets

2019-03-28 Thread Prasanna

Btw I am not able to comment on the jira. I will get this fixed and post the comment on the jira as well. Cheers. On Thu, Mar 28, 2019 at 12:49 AM Prasanna wrote: > Hey Nishith, > > Glad we have a concrete proposal on this. > > My 0.02 thoughts on this. > > What we are really building is an

Re: Global Index for partitioned Hudi datasets

2019-03-28 Thread Prasanna

Hey Nishith, Glad we have a concrete proposal on this. My 0.02 thoughts on this. What we are really building is an approximate indexing system which can help us reduce the number of files to look for when a key is updated. The problem with having something random in the key (like uuid) means

Re: Global Index for partitioned Hudi datasets

2019-03-28 Thread nishith agarwal

Here is the HIP : https://docs.google.com/document/d/1RdxVqF60N9yRUH7HZ-s2Y_aYHLHb9xGrlRLK1OWtYKM/edit?usp=sharing @Vinoth Chandar @balaji added you guys as approvers, please take a look. -Nishith On Tue, Mar 26, 2019 at 9:47 PM nishith agarwal wrote: > JIRA :

Re: Global Index for partitioned Hudi datasets

2019-03-26 Thread nishith agarwal

JIRA : https://issues.apache.org/jira/projects/HUDI/issues/HUDI-53 -Nishith On Tue, Mar 26, 2019 at 9:21 PM nishith agarwal wrote: > All, > > Currently, Hudi supports partitioned and non-partitioned datasets. A > partitioned dataset is one which bucketizes groups of files (data) into >

Global Index for partitioned Hudi datasets

2019-03-26 Thread nishith agarwal

All, Currently, Hudi supports partitioned and non-partitioned datasets. A partitioned dataset is one which bucketizes groups of files (data) into buckets called partitions. A hudi dataset may be composed of N number of partitions with M number of files. This structure helps canonical

Re: Global Index for partitioned Hudi datasets

Re: Global Index for partitioned Hudi datasets

Re: Global Index for partitioned Hudi datasets

Re: Global Index for partitioned Hudi datasets

Re: Global Index for partitioned Hudi datasets

Global Index for partitioned Hudi datasets

6 matches

Site Navigation

Mail list logo

Footer information