Hi,
This would require standardizing on a specific tokenization algorithm, right? I'm not sure it's a good idea to add such complexity to the Parquet spec (the tokenization might need to be language-specific and/or corpus-specific). I wonder if it would be more productive to try and find ways to build e.g. a Lucene index over Parquet columns (perhaps it's already possible?). Regards Antoine. On Wed, 7 Jun 2023 18:01:32 +0800 Gang Wu <[email protected]> wrote: > Hi Marco, > > That sounds interesting! > > However, this requires the parquet implementation to be able to tokenize > both > strings to write and literals in the filters. The actual efficiency depends > on the > data distribution. I am also concerned with the possible explosion of > distinct > values introduced by splitting words, which may result in a large bloom > filter. > > Have you tried any PoC to get a rough estimate of benefits in your use case? > > Best, > Gang > > > > On Tue, Jun 6, 2023 at 5:06 PM Marco Colli > <[email protected]> wrote: > > > Hello, > > > > I see that Parquet already supports Bloom filters. > > > > For my understanding, it currently uses them only on the entire value. > > > > Fo example, if I have a column "MovieTitle": > > > > - "The title of my movie" > > - "Another movie title" > > - "The best movie title" > > - ... > > > > Then the current Bloom filters can be used to find only the column > > chunks/pages that match an exact title. For example you can use the bloom > > filter to search for "The best movie title". > > > > It would be interesting to have *a bloom filter on the specific words*, > > instead of using the entire value: in this way you can search the word > > "best" in the "MovieTitle" column and find the titles that contain that > > specific word in an efficient way. > > > > It would enable a sort of full-text search of keywords inside text columns. > > It would also allow predicate pushdown for searches based on keywords. > > > > Would make sense to have such an addition? Is there any strategy already > > used by Parquet for fast keyword searches inside text columns? > > > > > > Best regards, > > Marco Colli > > AbstractBrain srls > > >
