Thanks Michael. I managed to translate the TermInSetQuery into Scala yesterday so now I can modify it in my codebase. This seems promising so far. Fingers crossed there's a way to maintain scores without basically converging to the BooleanQuery implementation. - AK
On Wed, Jun 24, 2020 at 8:40 AM Michael Sokolov <msoko...@gmail.com> wrote: > Yeah that will require some changes since what it does currently is to > maintain a bitset, and or into it repeatedly (once for each term's > docs). To maintain counts, you'd need a counter per doc (rather than a > bit), and you might lose some of the speed... > > On Tue, Jun 23, 2020 at 8:52 PM Alex K <aklib...@gmail.com> wrote: > > > > The TermsInSetQuery is definitely faster. Unfortunately it doesn't seem > to > > return the number of terms that matched in a given document. Rather it > just > > returns the boost value. I'll look into copying/modifying the internals > to > > return the number of matched terms. > > > > Thanks > > - AK > > > > On Tue, Jun 23, 2020 at 3:17 PM Alex K <aklib...@gmail.com> wrote: > > > > > Hi Michael, > > > Thanks for the quick response! > > > > > > I will look into the TermInSetQuery. > > > > > > My usage of "heap" might've been confusing. > > > I'm using a FunctionScoreQuery from Elasticsearch. > > > This gets instantiated with a Lucene query, in this case the boolean > query > > > as I described it, as well as a custom ScoreFunction object. > > > The ScoreFunction exposes a single method that takes a doc id and the > > > BooleanQuery score for that doc id, and returns another score. > > > In that method I use a MinMaxPriorityQueue from the Guava library to > > > maintain a fixed-capacity subset of the highest-scoring docs and > evaluate > > > exact similarity on them. > > > Once the queue is at capacity, I just return 0 for any docs that had a > > > boolean query score smaller than the min in the queue. > > > > > > But you can actually forget entirely that this ScoreFunction exists. It > > > only contributes ~6% of the runtime. > > > Even if I only use the BooleanQuery by itself, I still see the same > > > behavior and bottlenecks. > > > > > > Thanks > > > - AK > > > > > > > > > On Tue, Jun 23, 2020 at 2:06 PM Michael Sokolov <msoko...@gmail.com> > > > wrote: > > > > > >> You might consider using a TermInSetQuery in place of a BooleanQuery > > >> for the hashes (since they are all in the same field). > > >> > > >> I don't really understand why you are seeing so much cost in the heap > > >> - it's sounds as if you have a single heap with mixed scores - those > > >> generated by the BooleanQuery and those generated by the vector > > >> scoring operation. Maybe you comment a little more on the interaction > > >> there - are there really two heaps? Do you override the standard > > >> collector? > > >> > > >> On Tue, Jun 23, 2020 at 9:51 AM Alex K <aklib...@gmail.com> wrote: > > >> > > > >> > Hello all, > > >> > > > >> > I'm working on an Elasticsearch plugin (using Lucene internally) > that > > >> > allows users to index numerical vectors and run exact and > approximate > > >> > k-nearest-neighbors similarity queries. > > >> > I'd like to get some feedback about my usage of BooleanQueries and > > >> > TermQueries, and see if there are any optimizations or performance > > >> tricks > > >> > for my use case. > > >> > > > >> > An example use case for the plugin is reverse image search. A user > can > > >> > store vectors representing images and run a nearest-neighbors query > to > > >> > retrieve the 10 vectors with the smallest L2 distance to a query > vector. > > >> > More detailed documentation here: http://elastiknn.klibisz.com/ > > >> > > > >> > The main method for indexing the vectors is based on Locality > Sensitive > > >> > Hashing <https://en.wikipedia.org/wiki/Locality-sensitive_hashing>. > > >> > The general pattern is: > > >> > > > >> > 1. When indexing a vector, apply a hash function to it, > producing a > > >> set > > >> > of discrete hashes. Usually there are anywhere from 100 to 1000 > > >> hashes. > > >> > Similar vectors are more likely to share hashes (i.e., similar > > >> vectors > > >> > produce hash collisions). > > >> > 2. Convert each hash to a byte array and store the byte array as > a > > >> > Lucene Term at a specific field. > > >> > 3. Store the complete vector (i.e. floating point numbers) in a > > >> binary > > >> > doc values field. > > >> > > > >> > In other words, I'm converting each vector into a bag of words, > though > > >> the > > >> > words have no semantic meaning. > > >> > > > >> > A query works as follows: > > >> > > > >> > 1. Given a query vector, apply the same hash function to produce > a > > >> set > > >> > of hashes. > > >> > 2. Convert each hash to a byte array and create a Term. > > >> > 3. Build and run a BooleanQuery with a clause for each Term. Each > > >> clause > > >> > looks like this: `new BooleanClause(new ConstantScoreQuery(new > > >> > TermQuery(new Term(field, new BytesRef(hashValue.toByteArray))), > > >> > BooleanClause.Occur.SHOULD))`. > > >> > 4. As the BooleanQuery produces results, maintain a fixed-size > heap > > >> of > > >> > its scores. For any score exceeding the min in the heap, load its > > >> vector > > >> > from the binary doc values, compute the exact similarity, and > update > > >> the > > >> > heap. Otherwise the vector gets a score of 0. > > >> > > > >> > When profiling my benchmarks with VisualVM, I've found the > Elasticsearch > > >> > search threads spend > 50% of the runtime in these two methods: > > >> > > > >> > - org.apache.lucene.search.DisiPriorityQueue.downHeap (~58% of > > >> runtime) > > >> > - org.apache.lucene.search.DisjunctionDISIApproximation.nextDoc > (~8% > > >> of > > >> > runtime) > > >> > > > >> > So the time seems to be dominated by collecting and ordering the > results > > >> > produced by the BooleanQuery from step 3 above. > > >> > The exact similarity computation is only about 15% of the runtime. > If I > > >> > disable it entirely, I still see the same bottlenecks in VisualVM. > > >> > Reducing the number of hashes yields roughly linear scaling (i.e., > 400 > > >> > hashes take ~2x longer than 200 hashes). > > >> > > > >> > The use case seems different to text search in that there's no > semantic > > >> > meaning to the terms, their length, their ordering, their stems, > etc. > > >> > I basically just need the index to be a rudimentary HashMap, and I > only > > >> > care about the scores for the top k results. > > >> > With that in mind, I've made the following optimizations: > > >> > > > >> > - Disabled tokenization on the FieldType (setTokenized(false)) > > >> > - Disabled norms on the FieldType (setOmitNorms(true)) > > >> > - Set similarity to BooleanSimilarity on the elasticsearch > > >> > MappedFieldType > > >> > - Set index options to IndexOptions.Docs. > > >> > - Used the MoreLikeThis heuristic to pick a subset of terms. This > > >> > understandably only yields a speedup proportional to the number > of > > >> > discarded terms. > > >> > > > >> > I'm using Elasticsearch version 7.6.2 with Lucene 8.4.0. > > >> > The main query implementation is here > > >> > < > > >> > https://github.com/alexklibisz/elastiknn/blob/c951cf562ab0f911ee760c8be47c19aba98504b9/plugin/src/main/scala/com/klibisz/elastiknn/query/LshQuery.scala > > >> > > > >> > . > > >> > < > > >> > https://github.com/alexklibisz/elastiknn/blob/c951cf562ab0f911ee760c8be47c19aba98504b9/plugin/src/main/scala/com/klibisz/elastiknn/query/LshQuery.scala > > >> > > > >> > The actual query that gets executed by Elasticsearch is > instantiated on > > >> line > > >> > 98 > > >> > < > > >> > https://github.com/alexklibisz/elastiknn/blob/c951cf562ab0f911ee760c8be47c19aba98504b9/plugin/src/main/scala/com/klibisz/elastiknn/query/LshQuery.scala#L98 > > >> > > > >> > . > > >> > It's in Scala but all of the Java query classes should look > familiar. > > >> > > > >> > Maybe there are some settings that I'm not aware of? > > >> > Maybe I could optimize this by implementing a custom query or > scorer? > > >> > Maybe there's just no way to speed this up? > > >> > > > >> > I appreciate any input, examples, links, etc.. :) > > >> > Also, let me know if I can provide any additional details. > > >> > > > >> > Thanks, > > >> > Alex Klibisz > > >> > > >> --------------------------------------------------------------------- > > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > >> For additional commands, e-mail: java-user-h...@lucene.apache.org > > >> > > >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >