got you. thanks for the clarification. On Fri, Nov 5, 2021 at 3:53 PM Vinoth Chandar <[email protected]> wrote:
> Hi Siva, > > I think this is more about bloom filters and record level index, which is > different from RFC-27. > > RFC-08 talks about record level indexing. Bloom filter indexes have a > discuss thread just kicked off. > > Main thing we are trying to solidify in 0.10.0 is foundational > metadata table and concurrency mechanisms to be able to add an index in the > background say. > > Thanks > Vinoth > > On Fri, Nov 5, 2021 at 8:47 AM Sivabalan <[email protected]> wrote: > > > Thanks for bringing this up. We have a RFC-27 on data skipping > > < > > > https://cwiki.apache.org/confluence/display/HUDI/RFC-27+Data+skipping+index+to+improve+query+performance > > > > > which is the secondary indexing being discussed here. We are flushing out > > few more details on this end and will put up patches once we figure out > > the unknowns. We have a WIP patch here > > <https://github.com/apache/hudi/pull/3475>, but needs some refactoring > and > > updates before we its ready for review. > > And we are also thinking of moving the existing bloom filters (from data > > files) into metadata table and re-use them instead of reading from all > data > > files with the expectation to boost performance for index lookup. We will > > start a discussion thread around this and go from there. > > > > > > > > On Wed, Nov 3, 2021 at 5:36 PM Nicolas Paris <[email protected]> > > wrote: > > > > > > > > > In another words, we are generalizing this so hudi feels more like > > > > MySQL and not HBase/Cassandra (key value store). Thats the direction > > > > we are approaching. > > > > > > wow this is amazing. I haven't found yet RFC about this, nor ready to > > > test PR. > > > > > > This answer my initial question: with the secondary indexes options > > > comming, the hudi key shall be a primary key (if exists). There is no > > > reason to choose anything else. > > > > > > On Wed Nov 3, 2021 at 9:03 PM CET, Vinoth Chandar wrote: > > > > Hi. > > > > > > > > With the indexing approach we are taking, you should be able to add > > > > secondary indexes on any column. not just the key. > > > > In another words, we are generalizing this so hudi feels more like > > MySQL > > > > and not HBase/Cassandra (key value store). Thats the direction we are > > > > approaching. > > > > > > > > love to hear more feedback. > > > > > > > > On Tue, Nov 2, 2021 at 2:29 AM Nicolas Paris < > [email protected] > > > > > > > wrote: > > > > > > > > > for example does the move of blooms into hfiles (0.10.0 feature) > > makes > > > > > unique bloom keys mandatory ? > > > > > > > > > > > > > > > > > > > > On Thu Oct 28, 2021 at 7:00 PM CEST, Nicolas Paris wrote: > > > > > > > > > > > > > Are you asking if there are advantages to allowing duplicates > or > > > not > > > > > having keys in your table? > > > > > > it's all about allowing duplicates > > > > > > > > > > > > use case is say an Order table and choosing key = customer_id > > > > > > then being able to do indexed delete without need of prescanning > > the > > > > > > dataset > > > > > > > > > > > > I wonder if there will be trouble I am unaware of with such trick > > > > > > > > > > > > On Thu Oct 28, 2021 at 2:33 PM CEST, Vinoth Chandar wrote: > > > > > > > Hi, > > > > > > > > > > > > > > Are you asking if there are advantages to allowing duplicates > or > > > not > > > > > > > having > > > > > > > keys in your table? > > > > > > > > > > > > > > Having keys, helps with othe practical scenarios, in addition > to > > > what > > > > > > > you > > > > > > > called out. > > > > > > > e.g: Oftentimes, you would want to backfill an insert-only > table > > > and > > > > > you > > > > > > > don't want to introduce duplicates when doing so. > > > > > > > > > > > > > > Thanks > > > > > > > Vinoth > > > > > > > > > > > > > > On Tue, Oct 26, 2021 at 1:37 AM Nicolas Paris < > > > > > [email protected]> > > > > > > > wrote: > > > > > > > > > > > > > > > Hi devs, > > > > > > > > > > > > > > > > AFAIK, hudi has been designed to have primary keys in the > > hudi's > > > key. > > > > > > > > However it is possible to also choose a non unique field. I > > have > > > > > listed > > > > > > > > several trouble with such design: > > > > > > > > > > > > > > > > Non unique key yield to : > > > > > > > > - cannot delete / update a unique record > > > > > > > > - cannot apply primary key for new sql tables feature > > > > > > > > > > > > > > > > Is there other downsides to choose a non unique key you have > in > > > mind > > > > > ? > > > > > > > > > > > > > > > > In my case, having user_id as a hudi key will help to apply > > > deletion > > > > > on > > > > > > > > the user level in any user table. The table are insert only, > so > > > the > > > > > > > > drawbacks listed above do not really apply. In case of error > in > > > the > > > > > > > > tables I have several options: > > > > > > > > > > > > > > > > - rollback to a previous commit > > > > > > > > - read partition/filter overwrite partition > > > > > > > > > > > > > > > > Thanks > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > Regards, > > -Sivabalan > > > -- Regards, -Sivabalan
