Hi Marco, That sounds interesting!
However, this requires the parquet implementation to be able to tokenize both strings to write and literals in the filters. The actual efficiency depends on the data distribution. I am also concerned with the possible explosion of distinct values introduced by splitting words, which may result in a large bloom filter. Have you tried any PoC to get a rough estimate of benefits in your use case? Best, Gang On Tue, Jun 6, 2023 at 5:06 PM Marco Colli <[email protected]> wrote: > Hello, > > I see that Parquet already supports Bloom filters. > > For my understanding, it currently uses them only on the entire value. > > Fo example, if I have a column "MovieTitle": > > - "The title of my movie" > - "Another movie title" > - "The best movie title" > - ... > > Then the current Bloom filters can be used to find only the column > chunks/pages that match an exact title. For example you can use the bloom > filter to search for "The best movie title". > > It would be interesting to have *a bloom filter on the specific words*, > instead of using the entire value: in this way you can search the word > "best" in the "MovieTitle" column and find the titles that contain that > specific word in an efficient way. > > It would enable a sort of full-text search of keywords inside text columns. > It would also allow predicate pushdown for searches based on keywords. > > Would make sense to have such an addition? Is there any strategy already > used by Parquet for fast keyword searches inside text columns? > > > Best regards, > Marco Colli > AbstractBrain srls >
