Hi Marco,
Could you describe how your proposal differs from tokenizing the target
string and storing the list of tokens in a column that has a bloom filter
attached?  I think this should be supportable today by the format at least
if not existing libraries.

Thanks,
Micah

On Wednesday, June 7, 2023, Gang Wu <[email protected]> wrote:

> Hi Marco,
>
> That sounds interesting!
>
> However, this requires the parquet implementation to be able to tokenize
> both
> strings to write and literals in the filters. The actual efficiency depends
> on the
> data distribution. I am also concerned with the possible explosion of
> distinct
> values introduced by splitting words, which may result in a large bloom
> filter.
>
> Have you tried any PoC to get a rough estimate of benefits in your use
> case?
>
> Best,
> Gang
>
>
>
> On Tue, Jun 6, 2023 at 5:06 PM Marco Colli <[email protected]> wrote:
>
> > Hello,
> >
> > I see that Parquet already supports Bloom filters.
> >
> > For my understanding, it currently uses them only on the entire value.
> >
> > Fo example, if I have a column "MovieTitle":
> >
> > - "The title of my movie"
> > - "Another movie title"
> > - "The best movie title"
> > - ...
> >
> > Then the current Bloom filters can be used to find only the column
> > chunks/pages that match an exact title. For example you can use the bloom
> > filter to search for "The best movie title".
> >
> > It would be interesting to have *a bloom filter on the specific words*,
> > instead of using the entire value: in this way you can search the word
> > "best" in the "MovieTitle" column and find the titles that contain that
> > specific word in an efficient way.
> >
> > It would enable a sort of full-text search of keywords inside text
> columns.
> > It would also allow predicate pushdown for searches based on keywords.
> >
> > Would make sense to have such an addition? Is there any strategy already
> > used by Parquet for fast keyword searches inside text columns?
> >
> >
> > Best regards,
> > Marco Colli
> > AbstractBrain srls
> >
>

Reply via email to