Re: Optimizing a boolean query for 100s of term clauses

2020-06-25 Thread Alex K
Hi Tommaso, thanks for the input and links! I'll add your paper to my literature review. So far I've seen very promising results from modifying the TermInSetQuery. It was pretty simple to keep a map of `doc id -> matched term count` and then only evaluate the exact similarity on the top k doc ids.

Re: Optimizing a boolean query for 100s of term clauses

2020-06-25 Thread Toke Eskildsen
On Wed, 2020-06-24 at 13:46 -0400, Alex K wrote: > My implementation isn't specific to any particular dataset or access > pattern (i.e. infinite vs. subset). Without a clearly defined use case, I would say that the sequential scan approach is not the right one: As these things goes, someone will

Re: Optimizing a boolean query for 100s of term clauses

2020-06-25 Thread Tommaso Teofili
hi Alex, I had worked on a similar problem directly on Lucene (within Anserini toolkit) using LSH fingerprints of tokenized feature vector values. You can find code at [1] and some information on the Anserini documentation page [2] and in a short preprint [3]. As a side note my current thinking is

Re: Optimizing a boolean query for 100s of term clauses

2020-06-24 Thread Alex K
Hi Toke. Indeed a nice coincidence. It's an interesting and fun problem space! My implementation isn't specific to any particular dataset or access pattern (i.e. infinite vs. subset). So far the plugin supports exact L1, L2, Jaccard, Hamming, and Angular similarities with LSH variants for all but

Re: Optimizing a boolean query for 100s of term clauses

2020-06-24 Thread Alex K
Thanks Michael. I managed to translate the TermInSetQuery into Scala yesterday so now I can modify it in my codebase. This seems promising so far. Fingers crossed there's a way to maintain scores without basically converging to the BooleanQuery implementation. - AK On Wed, Jun 24, 2020 at 8:40 AM

Re: Optimizing a boolean query for 100s of term clauses

2020-06-24 Thread Toke Eskildsen
On Tue, 2020-06-23 at 09:50 -0400, Alex K wrote: > I'm working on an Elasticsearch plugin (using Lucene internally) that > allows users to index numerical vectors and run exact and approximate > k-nearest-neighbors similarity queries. Quite a coincidence. I'm looking into the same thing :-) > 1

Re: Optimizing a boolean query for 100s of term clauses

2020-06-24 Thread Michael Sokolov
Yeah that will require some changes since what it does currently is to maintain a bitset, and or into it repeatedly (once for each term's docs). To maintain counts, you'd need a counter per doc (rather than a bit), and you might lose some of the speed... On Tue, Jun 23, 2020 at 8:52 PM Alex K wro

Re: Optimizing a boolean query for 100s of term clauses

2020-06-23 Thread Alex K
The TermsInSetQuery is definitely faster. Unfortunately it doesn't seem to return the number of terms that matched in a given document. Rather it just returns the boost value. I'll look into copying/modifying the internals to return the number of matched terms. Thanks - AK On Tue, Jun 23, 2020 at

Re: Optimizing a boolean query for 100s of term clauses

2020-06-23 Thread Alex K
Hi Michael, Thanks for the quick response! I will look into the TermInSetQuery. My usage of "heap" might've been confusing. I'm using a FunctionScoreQuery from Elasticsearch. This gets instantiated with a Lucene query, in this case the boolean query as I described it, as well as a custom ScoreFun

Re: Optimizing a boolean query for 100s of term clauses

2020-06-23 Thread Michael Sokolov
You might consider using a TermInSetQuery in place of a BooleanQuery for the hashes (since they are all in the same field). I don't really understand why you are seeing so much cost in the heap - it's sounds as if you have a single heap with mixed scores - those generated by the BooleanQuery and t

Optimizing a boolean query for 100s of term clauses

2020-06-23 Thread Alex K
Hello all, I'm working on an Elasticsearch plugin (using Lucene internally) that allows users to index numerical vectors and run exact and approximate k-nearest-neighbors similarity queries. I'd like to get some feedback about my usage of BooleanQueries and TermQueries, and see if there are any op