@Micah Does that mean that columns of type array already get a bloom filter
on each single value?
I am using Apache Arrow in particular to deal with Parquet files

Il Mer 7 Giu 2023, 16:00 Micah Kornfield <[email protected]> ha scritto:

> Hi Marco,
> Could you describe how your proposal differs from tokenizing the target
> string and storing the list of tokens in a column that has a bloom filter
> attached?  I think this should be supportable today by the format at least
> if not existing libraries.
>
> Thanks,
> Micah
>
> On Wednesday, June 7, 2023, Gang Wu <[email protected]> wrote:
>
> > Hi Marco,
> >
> > That sounds interesting!
> >
> > However, this requires the parquet implementation to be able to tokenize
> > both
> > strings to write and literals in the filters. The actual efficiency
> depends
> > on the
> > data distribution. I am also concerned with the possible explosion of
> > distinct
> > values introduced by splitting words, which may result in a large bloom
> > filter.
> >
> > Have you tried any PoC to get a rough estimate of benefits in your use
> > case?
> >
> > Best,
> > Gang
> >
> >
> >
> > On Tue, Jun 6, 2023 at 5:06 PM Marco Colli <[email protected]>
> wrote:
> >
> > > Hello,
> > >
> > > I see that Parquet already supports Bloom filters.
> > >
> > > For my understanding, it currently uses them only on the entire value.
> > >
> > > Fo example, if I have a column "MovieTitle":
> > >
> > > - "The title of my movie"
> > > - "Another movie title"
> > > - "The best movie title"
> > > - ...
> > >
> > > Then the current Bloom filters can be used to find only the column
> > > chunks/pages that match an exact title. For example you can use the
> bloom
> > > filter to search for "The best movie title".
> > >
> > > It would be interesting to have *a bloom filter on the specific words*,
> > > instead of using the entire value: in this way you can search the word
> > > "best" in the "MovieTitle" column and find the titles that contain that
> > > specific word in an efficient way.
> > >
> > > It would enable a sort of full-text search of keywords inside text
> > columns.
> > > It would also allow predicate pushdown for searches based on keywords.
> > >
> > > Would make sense to have such an addition? Is there any strategy
> already
> > > used by Parquet for fast keyword searches inside text columns?
> > >
> > >
> > > Best regards,
> > > Marco Colli
> > > AbstractBrain srls
> > >
> >
>

Reply via email to