Re: Limitations of non unique keys

Sivabalan Fri, 05 Nov 2021 15:31:46 -0700

got you. thanks for the clarification.

On Fri, Nov 5, 2021 at 3:53 PM Vinoth Chandar <[email protected]>
wrote:


> Hi Siva,
>
> I think this is more about bloom filters and record level index, which is
> different from RFC-27.
>
> RFC-08 talks about record level indexing. Bloom filter indexes have a
> discuss thread just kicked off.
>
> Main thing we are trying to solidify in 0.10.0 is foundational
> metadata table and concurrency mechanisms to be able to add an index in the
> background say.
>
> Thanks
> Vinoth
>
> On Fri, Nov 5, 2021 at 8:47 AM Sivabalan <[email protected]> wrote:
>
> > Thanks for bringing this up. We have a RFC-27 on data skipping
> > <
> >
> https://cwiki.apache.org/confluence/display/HUDI/RFC-27+Data+skipping+index+to+improve+query+performance
> > >
> > which is the secondary indexing being discussed here. We are flushing out
> > few more details on this end and will put up patches once we figure out
> > the unknowns. We have a WIP patch here
> > <https://github.com/apache/hudi/pull/3475>, but needs some refactoring
> and
> > updates before we its ready for review.
> > And we are also thinking of moving the existing bloom filters (from data
> > files) into metadata table and re-use them instead of reading from all
> data
> > files with the expectation to boost performance for index lookup. We will
> > start a discussion thread around this and go from there.
> >
> >
> >
> > On Wed, Nov 3, 2021 at 5:36 PM Nicolas Paris <[email protected]>
> > wrote:
> >
> > >
> > > > In another words, we are generalizing this so hudi feels more like
> > > > MySQL and not HBase/Cassandra (key value store). Thats the direction
> > > > we are approaching.
> > >
> > > wow this is amazing. I haven't found yet RFC about this, nor ready to
> > > test PR.
> > >
> > > This answer my initial question: with the secondary indexes options
> > > comming, the hudi key shall be a primary key (if exists). There is no
> > > reason to choose anything else.
> > >
> > > On Wed Nov 3, 2021 at 9:03 PM CET, Vinoth Chandar wrote:
> > > > Hi.
> > > >
> > > > With the indexing approach we are taking, you should be able to add
> > > > secondary indexes on any column. not just the key.
> > > > In another words, we are generalizing this so hudi feels more like
> > MySQL
> > > > and not HBase/Cassandra (key value store). Thats the direction we are
> > > > approaching.
> > > >
> > > > love to hear more feedback.
> > > >
> > > > On Tue, Nov 2, 2021 at 2:29 AM Nicolas Paris <
> [email protected]
> > >
> > > > wrote:
> > > >
> > > > > for example does the move of blooms into hfiles (0.10.0 feature)
> > makes
> > > > > unique bloom keys mandatory ?
> > > > >
> > > > >
> > > > >
> > > > > On Thu Oct 28, 2021 at 7:00 PM CEST, Nicolas Paris wrote:
> > > > > >
> > > > > > > Are you asking if there are advantages to allowing duplicates
> or
> > > not
> > > > > having keys in your table?
> > > > > > it's all about allowing duplicates
> > > > > >
> > > > > > use case is say an Order table and choosing key = customer_id
> > > > > > then being able to do indexed delete without need of prescanning
> > the
> > > > > > dataset
> > > > > >
> > > > > > I wonder if there will be trouble I am unaware of with such trick
> > > > > >
> > > > > > On Thu Oct 28, 2021 at 2:33 PM CEST, Vinoth Chandar wrote:
> > > > > > > Hi,
> > > > > > >
> > > > > > > Are you asking if there are advantages to allowing duplicates
> or
> > > not
> > > > > > > having
> > > > > > > keys in your table?
> > > > > > >
> > > > > > > Having keys, helps with othe practical scenarios, in addition
> to
> > > what
> > > > > > > you
> > > > > > > called out.
> > > > > > > e.g: Oftentimes, you would want to backfill an insert-only
> table
> > > and
> > > > > you
> > > > > > > don't want to introduce duplicates when doing so.
> > > > > > >
> > > > > > > Thanks
> > > > > > > Vinoth
> > > > > > >
> > > > > > > On Tue, Oct 26, 2021 at 1:37 AM Nicolas Paris <
> > > > > [email protected]>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi devs,
> > > > > > > >
> > > > > > > > AFAIK, hudi has been designed to have primary keys in the
> > hudi's
> > > key.
> > > > > > > > However it is possible to also choose a non unique field. I
> > have
> > > > > listed
> > > > > > > > several trouble with such design:
> > > > > > > >
> > > > > > > > Non unique key yield to :
> > > > > > > > - cannot delete / update a unique record
> > > > > > > > - cannot apply primary key for new sql tables feature
> > > > > > > >
> > > > > > > > Is there other downsides to choose a non unique key you have
> in
> > > mind
> > > > > ?
> > > > > > > >
> > > > > > > > In my case, having user_id as a hudi key will help to apply
> > > deletion
> > > > > on
> > > > > > > > the user level in any user table. The table are insert only,
> so
> > > the
> > > > > > > > drawbacks listed above do not really apply. In case of error
> in
> > > the
> > > > > > > > tables I have several options:
> > > > > > > >
> > > > > > > > - rollback to a previous commit
> > > > > > > > - read partition/filter overwrite partition
> > > > > > > >
> > > > > > > > Thanks
> > > > > > > >
> > > > >
> > > > >
> > >
> > >
> >
> > --
> > Regards,
> > -Sivabalan
> >
>


-- 
Regards,
-Sivabalan

Re: Limitations of non unique keys

Reply via email to