Hi Marco,

This is an exciting idea! You think about more use cases of Parquet! As an
open community, we always welcome new ideas and innovations like yours.  I
encourage you to go deeper and broader with this idea and come up with a
proposal and POC. Today, generative AI came to reality. In addition to the
keyword search, you can think of other things like OpenAI embeddings. Maybe
later Parquet filters can do matches based on the closeness of two
embedding vectors.

With that said, other people's comments are also valid that Parquet is a
strict file format and we need to standardization. So we look forward to
your proposal and POC. If you want to come to discuss this week's sync
meeting, you are more than welcome.  I added you.

Xinli Shang

On Thu, Jun 15, 2023 at 4:38 AM Antoine Pitrou <[email protected]> wrote:

>
> Hi,
>
> This would require standardizing on a specific tokenization algorithm,
> right? I'm not sure it's a good idea to add such complexity to the
> Parquet spec (the tokenization might need to be language-specific
> and/or corpus-specific).
>
> I wonder if it would be more productive to try and find ways to build
> e.g. a Lucene index over Parquet columns (perhaps it's already
> possible?).
>
> Regards
>
> Antoine.
>
>
>
> On Wed, 7 Jun 2023 18:01:32 +0800
> Gang Wu <[email protected]> wrote:
> > Hi Marco,
> >
> > That sounds interesting!
> >
> > However, this requires the parquet implementation to be able to tokenize
> > both
> > strings to write and literals in the filters. The actual efficiency
> depends
> > on the
> > data distribution. I am also concerned with the possible explosion of
> > distinct
> > values introduced by splitting words, which may result in a large bloom
> > filter.
> >
> > Have you tried any PoC to get a rough estimate of benefits in your use
> case?
> >
> > Best,
> > Gang
> >
> >
> >
> > On Tue, Jun 6, 2023 at 5:06 PM Marco Colli <
> [email protected]> wrote:
> >
> > > Hello,
> > >
> > > I see that Parquet already supports Bloom filters.
> > >
> > > For my understanding, it currently uses them only on the entire value.
> > >
> > > Fo example, if I have a column "MovieTitle":
> > >
> > > - "The title of my movie"
> > > - "Another movie title"
> > > - "The best movie title"
> > > - ...
> > >
> > > Then the current Bloom filters can be used to find only the column
> > > chunks/pages that match an exact title. For example you can use the
> bloom
> > > filter to search for "The best movie title".
> > >
> > > It would be interesting to have *a bloom filter on the specific words*,
> > > instead of using the entire value: in this way you can search the word
> > > "best" in the "MovieTitle" column and find the titles that contain that
> > > specific word in an efficient way.
> > >
> > > It would enable a sort of full-text search of keywords inside text
> columns.
> > > It would also allow predicate pushdown for searches based on keywords.
> > >
> > > Would make sense to have such an addition? Is there any strategy
> already
> > > used by Parquet for fast keyword searches inside text columns?
> > >
> > >
> > > Best regards,
> > > Marco Colli
> > > AbstractBrain srls
> > >
> >
>
>
>
>

-- 
Xinli Shang

Reply via email to