Hi,

This would require standardizing on a specific tokenization algorithm,
right? I'm not sure it's a good idea to add such complexity to the
Parquet spec (the tokenization might need to be language-specific
and/or corpus-specific).

I wonder if it would be more productive to try and find ways to build
e.g. a Lucene index over Parquet columns (perhaps it's already
possible?).

Regards

Antoine.



On Wed, 7 Jun 2023 18:01:32 +0800
Gang Wu <[email protected]> wrote:
> Hi Marco,
> 
> That sounds interesting!
> 
> However, this requires the parquet implementation to be able to tokenize
> both
> strings to write and literals in the filters. The actual efficiency depends
> on the
> data distribution. I am also concerned with the possible explosion of
> distinct
> values introduced by splitting words, which may result in a large bloom
> filter.
> 
> Have you tried any PoC to get a rough estimate of benefits in your use case?
> 
> Best,
> Gang
> 
> 
> 
> On Tue, Jun 6, 2023 at 5:06 PM Marco Colli 
> <[email protected]> wrote:
> 
> > Hello,
> >
> > I see that Parquet already supports Bloom filters.
> >
> > For my understanding, it currently uses them only on the entire value.
> >
> > Fo example, if I have a column "MovieTitle":
> >
> > - "The title of my movie"
> > - "Another movie title"
> > - "The best movie title"
> > - ...
> >
> > Then the current Bloom filters can be used to find only the column
> > chunks/pages that match an exact title. For example you can use the bloom
> > filter to search for "The best movie title".
> >
> > It would be interesting to have *a bloom filter on the specific words*,
> > instead of using the entire value: in this way you can search the word
> > "best" in the "MovieTitle" column and find the titles that contain that
> > specific word in an efficient way.
> >
> > It would enable a sort of full-text search of keywords inside text columns.
> > It would also allow predicate pushdown for searches based on keywords.
> >
> > Would make sense to have such an addition? Is there any strategy already
> > used by Parquet for fast keyword searches inside text columns?
> >
> >
> > Best regards,
> > Marco Colli
> > AbstractBrain srls
> >  
> 



Reply via email to