Sounds great! On Tue, Oct 26, 2021 at 7:26 AM Nicolas Paris <[email protected]> wrote:
> Hi Vinoth, > > Thanks for the starter. Definitely once the new way to manage indexes > and we get migrated on hudi on our datalake, I d'be glad to give this a > shot. > > > Regards, Nicolas > > On Fri Oct 22, 2021 at 4:33 PM CEST, Vinoth Chandar wrote: > > Hi Nicolas, > > > > Thanks for raising this! I think it's a very valid ask. > > https://issues.apache.org/jira/browse/HUDI-2601 has been raised. > > > > As a proof of concept, would you be able to give filterExists() a shot > > and > > see if the filtering time improves? > > > https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/HoodieReadClient.java#L172 > > > > In the upcoming 0.10.0 release, we are planning to move the bloom > > filters > > out to a partition on the metadata table, to even speed this up for very > > large tables. > > https://issues.apache.org/jira/browse/HUDI-1295 > > > > Please let us know if you are interested in testing that when the PR is > > up. > > > > Thanks > > Vinoth > > > > On Tue, Oct 19, 2021 at 4:38 AM Nicolas Paris <[email protected]> > > wrote: > > > > > hi ! > > > > > > In my use case, for GDPR I have to export all informations of a given > > > user from several hudi HUGE tables. Filtering the table results in a > > > full scan of around 10 hours and this will get worst year after year. > > > > > > Since the filter criteria is based on the bloom key (user_id) it would > > > be handy to exploit the bloom and produce a temporary table (in the > > > metastore for eg) with the resulting rows. > > > > > > So far the bloom indexing is used for update/delete operations on a > hudi > > > table. > > > > > > 1. There is a oportunity to exploit the bloom for select operations. > > > the hudi options would be: > > > operation: select > > > result-table: <table name> > > > result-path: <s3 path|hdfs path> > > > result-schema: <table schema in metastore> (optional ; when empty no > > > sync with the hms, only raw path) > > > > > > > > > 2. It could be implemented as predicate push down in the spark > > > datasource API. When filtering with a IN statement. > > > > > > > > > Thought ? > > > > >
