Re: Bloom filters for full-text search and predicate pushdown

Gang Wu Wed, 07 Jun 2023 03:02:23 -0700

Hi Marco,

That sounds interesting!


However, this requires the parquet implementation to be able to tokenize
both
strings to write and literals in the filters. The actual efficiency depends
on the
data distribution. I am also concerned with the possible explosion of
distinct
values introduced by splitting words, which may result in a large bloom
filter.

Have you tried any PoC to get a rough estimate of benefits in your use case?

Best,
Gang



On Tue, Jun 6, 2023 at 5:06 PM Marco Colli <[email protected]> wrote:

> Hello,
>
> I see that Parquet already supports Bloom filters.
>
> For my understanding, it currently uses them only on the entire value.
>
> Fo example, if I have a column "MovieTitle":
>
> - "The title of my movie"
> - "Another movie title"
> - "The best movie title"
> - ...
>
> Then the current Bloom filters can be used to find only the column
> chunks/pages that match an exact title. For example you can use the bloom
> filter to search for "The best movie title".
>
> It would be interesting to have *a bloom filter on the specific words*,
> instead of using the entire value: in this way you can search the word
> "best" in the "MovieTitle" column and find the titles that contain that
> specific word in an efficient way.
>
> It would enable a sort of full-text search of keywords inside text columns.
> It would also allow predicate pushdown for searches based on keywords.
>
> Would make sense to have such an addition? Is there any strategy already
> used by Parquet for fast keyword searches inside text columns?
>
>
> Best regards,
> Marco Colli
> AbstractBrain srls
>

Re: Bloom filters for full-text search and predicate pushdown

Reply via email to