Hi Marco, Could you describe how your proposal differs from tokenizing the target string and storing the list of tokens in a column that has a bloom filter attached? I think this should be supportable today by the format at least if not existing libraries.
Thanks, Micah On Wednesday, June 7, 2023, Gang Wu <[email protected]> wrote: > Hi Marco, > > That sounds interesting! > > However, this requires the parquet implementation to be able to tokenize > both > strings to write and literals in the filters. The actual efficiency depends > on the > data distribution. I am also concerned with the possible explosion of > distinct > values introduced by splitting words, which may result in a large bloom > filter. > > Have you tried any PoC to get a rough estimate of benefits in your use > case? > > Best, > Gang > > > > On Tue, Jun 6, 2023 at 5:06 PM Marco Colli <[email protected]> wrote: > > > Hello, > > > > I see that Parquet already supports Bloom filters. > > > > For my understanding, it currently uses them only on the entire value. > > > > Fo example, if I have a column "MovieTitle": > > > > - "The title of my movie" > > - "Another movie title" > > - "The best movie title" > > - ... > > > > Then the current Bloom filters can be used to find only the column > > chunks/pages that match an exact title. For example you can use the bloom > > filter to search for "The best movie title". > > > > It would be interesting to have *a bloom filter on the specific words*, > > instead of using the entire value: in this way you can search the word > > "best" in the "MovieTitle" column and find the titles that contain that > > specific word in an efficient way. > > > > It would enable a sort of full-text search of keywords inside text > columns. > > It would also allow predicate pushdown for searches based on keywords. > > > > Would make sense to have such an addition? Is there any strategy already > > used by Parquet for fast keyword searches inside text columns? > > > > > > Best regards, > > Marco Colli > > AbstractBrain srls > > >
