Hi Nicolas,

Thanks for raising this! I think it's a very valid ask.
https://issues.apache.org/jira/browse/HUDI-2601 has been raised.

As a proof of concept, would you be able to give filterExists() a shot  and
see if the filtering time improves?
https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/HoodieReadClient.java#L172

In the upcoming 0.10.0 release, we are planning to move the bloom filters
out to a partition on the metadata table, to even speed this up for very
large tables.
https://issues.apache.org/jira/browse/HUDI-1295

Please let us know if you are interested in testing that when the PR is up.

Thanks
Vinoth

On Tue, Oct 19, 2021 at 4:38 AM Nicolas Paris <nicolas.pa...@riseup.net>
wrote:

> hi !
>
> In my use case, for GDPR I have to export all informations of a given
> user from several hudi HUGE tables. Filtering the table results in a
> full scan of around 10 hours and this will get worst year after year.
>
> Since the filter criteria is based on the bloom key (user_id) it would
> be handy to exploit the bloom and produce a temporary table (in the
> metastore for eg) with the resulting rows.
>
> So far the bloom indexing is used for update/delete operations on a hudi
> table.
>
> 1. There is a oportunity to exploit the bloom for select operations.
> the hudi options would be:
> operation: select
> result-table: <table name>
> result-path: <s3 path|hdfs path>
> result-schema: <table schema in metastore> (optional ; when empty no
> sync with the hms, only raw path)
>
>
> 2. It could be implemented as predicate push down in the spark
> datasource API. When filtering with a IN statement.
>
>
> Thought ?
>

Reply via email to