Thanks for bringing this up. We have a RFC-27 on data skipping
<https://cwiki.apache.org/confluence/display/HUDI/RFC-27+Data+skipping+index+to+improve+query+performance>
which is the secondary indexing being discussed here. We are flushing out
few more details on this end and will put up patches once we figure out
the unknowns. We have a WIP patch here
<https://github.com/apache/hudi/pull/3475>, but needs some refactoring and
updates before we its ready for review.
And we are also thinking of moving the existing bloom filters (from data
files) into metadata table and re-use them instead of reading from all data
files with the expectation to boost performance for index lookup. We will
start a discussion thread around this and go from there.



On Wed, Nov 3, 2021 at 5:36 PM Nicolas Paris <nicolas.pa...@riseup.net>
wrote:

>
> > In another words, we are generalizing this so hudi feels more like
> > MySQL and not HBase/Cassandra (key value store). Thats the direction
> > we are approaching.
>
> wow this is amazing. I haven't found yet RFC about this, nor ready to
> test PR.
>
> This answer my initial question: with the secondary indexes options
> comming, the hudi key shall be a primary key (if exists). There is no
> reason to choose anything else.
>
> On Wed Nov 3, 2021 at 9:03 PM CET, Vinoth Chandar wrote:
> > Hi.
> >
> > With the indexing approach we are taking, you should be able to add
> > secondary indexes on any column. not just the key.
> > In another words, we are generalizing this so hudi feels more like MySQL
> > and not HBase/Cassandra (key value store). Thats the direction we are
> > approaching.
> >
> > love to hear more feedback.
> >
> > On Tue, Nov 2, 2021 at 2:29 AM Nicolas Paris <nicolas.pa...@riseup.net>
> > wrote:
> >
> > > for example does the move of blooms into hfiles (0.10.0 feature) makes
> > > unique bloom keys mandatory ?
> > >
> > >
> > >
> > > On Thu Oct 28, 2021 at 7:00 PM CEST, Nicolas Paris wrote:
> > > >
> > > > > Are you asking if there are advantages to allowing duplicates or
> not
> > > having keys in your table?
> > > > it's all about allowing duplicates
> > > >
> > > > use case is say an Order table and choosing key = customer_id
> > > > then being able to do indexed delete without need of prescanning the
> > > > dataset
> > > >
> > > > I wonder if there will be trouble I am unaware of with such trick
> > > >
> > > > On Thu Oct 28, 2021 at 2:33 PM CEST, Vinoth Chandar wrote:
> > > > > Hi,
> > > > >
> > > > > Are you asking if there are advantages to allowing duplicates or
> not
> > > > > having
> > > > > keys in your table?
> > > > >
> > > > > Having keys, helps with othe practical scenarios, in addition to
> what
> > > > > you
> > > > > called out.
> > > > > e.g: Oftentimes, you would want to backfill an insert-only table
> and
> > > you
> > > > > don't want to introduce duplicates when doing so.
> > > > >
> > > > > Thanks
> > > > > Vinoth
> > > > >
> > > > > On Tue, Oct 26, 2021 at 1:37 AM Nicolas Paris <
> > > nicolas.pa...@riseup.net>
> > > > > wrote:
> > > > >
> > > > > > Hi devs,
> > > > > >
> > > > > > AFAIK, hudi has been designed to have primary keys in the hudi's
> key.
> > > > > > However it is possible to also choose a non unique field. I have
> > > listed
> > > > > > several trouble with such design:
> > > > > >
> > > > > > Non unique key yield to :
> > > > > > - cannot delete / update a unique record
> > > > > > - cannot apply primary key for new sql tables feature
> > > > > >
> > > > > > Is there other downsides to choose a non unique key you have in
> mind
> > > ?
> > > > > >
> > > > > > In my case, having user_id as a hudi key will help to apply
> deletion
> > > on
> > > > > > the user level in any user table. The table are insert only, so
> the
> > > > > > drawbacks listed above do not really apply. In case of error in
> the
> > > > > > tables I have several options:
> > > > > >
> > > > > > - rollback to a previous commit
> > > > > > - read partition/filter overwrite partition
> > > > > >
> > > > > > Thanks
> > > > > >
> > >
> > >
>
>

-- 
Regards,
-Sivabalan

Reply via email to