Re: feature request/proposal: leverage bloom indexes for readingb

2021-10-26 Thread Nicolas Paris
Hi Vinoth,

Thanks for the starter. Definitely once the new way to manage indexes
and we get migrated on hudi on our datalake, I d'be glad to give this a
shot.


Regards, Nicolas

On Fri Oct 22, 2021 at 4:33 PM CEST, Vinoth Chandar wrote:
> Hi Nicolas,
>
> Thanks for raising this! I think it's a very valid ask.
> https://issues.apache.org/jira/browse/HUDI-2601 has been raised.
>
> As a proof of concept, would you be able to give filterExists() a shot
> and
> see if the filtering time improves?
> https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/HoodieReadClient.java#L172
>
> In the upcoming 0.10.0 release, we are planning to move the bloom
> filters
> out to a partition on the metadata table, to even speed this up for very
> large tables.
> https://issues.apache.org/jira/browse/HUDI-1295
>
> Please let us know if you are interested in testing that when the PR is
> up.
>
> Thanks
> Vinoth
>
> On Tue, Oct 19, 2021 at 4:38 AM Nicolas Paris 
> wrote:
>
> > hi !
> >
> > In my use case, for GDPR I have to export all informations of a given
> > user from several hudi HUGE tables. Filtering the table results in a
> > full scan of around 10 hours and this will get worst year after year.
> >
> > Since the filter criteria is based on the bloom key (user_id) it would
> > be handy to exploit the bloom and produce a temporary table (in the
> > metastore for eg) with the resulting rows.
> >
> > So far the bloom indexing is used for update/delete operations on a hudi
> > table.
> >
> > 1. There is a oportunity to exploit the bloom for select operations.
> > the hudi options would be:
> > operation: select
> > result-table: 
> > result-path: 
> > result-schema:  (optional ; when empty no
> > sync with the hms, only raw path)
> >
> >
> > 2. It could be implemented as predicate push down in the spark
> > datasource API. When filtering with a IN statement.
> >
> >
> > Thought ?
> >



Limitations of non unique keys

2021-10-26 Thread Nicolas Paris
Hi devs,

AFAIK, hudi has been designed to have primary keys in the hudi's key.
However it is possible to also choose a non unique field. I have listed
several trouble with such design:

Non unique key yield to :
- cannot delete / update a unique record
- cannot apply primary key for new sql tables feature

Is there other downsides to choose a non unique key you have in mind ?

In my case, having user_id as a hudi key will help to apply deletion on
the user level in any user table. The table are insert only, so the
drawbacks listed above do not really apply. In case of error in the
tables I have several options:

- rollback to a previous commit
- read partition/filter overwrite partition

Thanks