Re: feature request/proposal: leverage bloom indexes for readingb

Nicolas Paris Tue, 26 Oct 2021 07:26:56 -0700

Hi Vinoth,

Thanks for the starter. Definitely once the new way to manage indexes
and we get migrated on hudi on our datalake, I d'be glad to give this a
shot.



Regards, Nicolas

On Fri Oct 22, 2021 at 4:33 PM CEST, Vinoth Chandar wrote:
> Hi Nicolas,
>
> Thanks for raising this! I think it's a very valid ask.
> https://issues.apache.org/jira/browse/HUDI-2601 has been raised.
>
> As a proof of concept, would you be able to give filterExists() a shot
> and
> see if the filtering time improves?
> https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/HoodieReadClient.java#L172
>
> In the upcoming 0.10.0 release, we are planning to move the bloom
> filters
> out to a partition on the metadata table, to even speed this up for very
> large tables.
> https://issues.apache.org/jira/browse/HUDI-1295
>
> Please let us know if you are interested in testing that when the PR is
> up.
>
> Thanks
> Vinoth
>
> On Tue, Oct 19, 2021 at 4:38 AM Nicolas Paris <[email protected]>
> wrote:
>
> > hi !
> >
> > In my use case, for GDPR I have to export all informations of a given
> > user from several hudi HUGE tables. Filtering the table results in a
> > full scan of around 10 hours and this will get worst year after year.
> >
> > Since the filter criteria is based on the bloom key (user_id) it would
> > be handy to exploit the bloom and produce a temporary table (in the
> > metastore for eg) with the resulting rows.
> >
> > So far the bloom indexing is used for update/delete operations on a hudi
> > table.
> >
> > 1. There is a oportunity to exploit the bloom for select operations.
> > the hudi options would be:
> > operation: select
> > result-table: <table name>
> > result-path: <s3 path|hdfs path>
> > result-schema: <table schema in metastore> (optional ; when empty no
> > sync with the hms, only raw path)
> >
> >
> > 2. It could be implemented as predicate push down in the spark
> > datasource API. When filtering with a IN statement.
> >
> >
> > Thought ?
> >

Re: feature request/proposal: leverage bloom indexes for readingb

Reply via email to