Re: Optimizing a boolean query for 100s of term clauses

Alex K Wed, 24 Jun 2020 10:30:24 -0700

Thanks Michael. I managed to translate the TermInSetQuery into Scala
yesterday so now I can modify it in my codebase. This seems promising so
far. Fingers crossed there's a way to maintain scores without basically
converging to the BooleanQuery implementation.
- AK


On Wed, Jun 24, 2020 at 8:40 AM Michael Sokolov <msoko...@gmail.com> wrote:

> Yeah that will require some changes since what it does currently is to
> maintain a bitset, and or into it repeatedly (once for each term's
> docs). To maintain counts, you'd need a counter per doc (rather than a
> bit), and you might lose some of the speed...
>
> On Tue, Jun 23, 2020 at 8:52 PM Alex K <aklib...@gmail.com> wrote:
> >
> > The TermsInSetQuery is definitely faster. Unfortunately it doesn't seem
> to
> > return the number of terms that matched in a given document. Rather it
> just
> > returns the boost value. I'll look into copying/modifying the internals
> to
> > return the number of matched terms.
> >
> > Thanks
> > - AK
> >
> > On Tue, Jun 23, 2020 at 3:17 PM Alex K <aklib...@gmail.com> wrote:
> >
> > > Hi Michael,
> > > Thanks for the quick response!
> > >
> > > I will look into the TermInSetQuery.
> > >
> > > My usage of "heap" might've been confusing.
> > > I'm using a FunctionScoreQuery from Elasticsearch.
> > > This gets instantiated with a Lucene query, in this case the boolean
> query
> > > as I described it, as well as a custom ScoreFunction object.
> > > The ScoreFunction exposes a single method that takes a doc id and the
> > > BooleanQuery score for that doc id, and returns another score.
> > > In that method I use a MinMaxPriorityQueue from the Guava library to
> > > maintain a fixed-capacity subset of the highest-scoring docs and
> evaluate
> > > exact similarity on them.
> > > Once the queue is at capacity, I just return 0 for any docs that had a
> > > boolean query score smaller than the min in the queue.
> > >
> > > But you can actually forget entirely that this ScoreFunction exists. It
> > > only contributes ~6% of the runtime.
> > > Even if I only use the BooleanQuery by itself, I still see the same
> > > behavior and bottlenecks.
> > >
> > > Thanks
> > > - AK
> > >
> > >
> > > On Tue, Jun 23, 2020 at 2:06 PM Michael Sokolov <msoko...@gmail.com>
> > > wrote:
> > >
> > >> You might consider using a TermInSetQuery in place of a BooleanQuery
> > >> for the hashes (since they are all in the same field).
> > >>
> > >> I don't really understand why you are seeing so much cost in the heap
> > >> - it's sounds as if you have a single heap with mixed scores - those
> > >> generated by the BooleanQuery and those generated by the vector
> > >> scoring operation. Maybe you comment a little more on the interaction
> > >> there - are there really two heaps? Do you override the standard
> > >> collector?
> > >>
> > >> On Tue, Jun 23, 2020 at 9:51 AM Alex K <aklib...@gmail.com> wrote:
> > >> >
> > >> > Hello all,
> > >> >
> > >> > I'm working on an Elasticsearch plugin (using Lucene internally)
> that
> > >> > allows users to index numerical vectors and run exact and
> approximate
> > >> > k-nearest-neighbors similarity queries.
> > >> > I'd like to get some feedback about my usage of BooleanQueries and
> > >> > TermQueries, and see if there are any optimizations or performance
> > >> tricks
> > >> > for my use case.
> > >> >
> > >> > An example use case for the plugin is reverse image search. A user
> can
> > >> > store vectors representing images and run a nearest-neighbors query
> to
> > >> > retrieve the 10 vectors with the smallest L2 distance to a query
> vector.
> > >> > More detailed documentation here: http://elastiknn.klibisz.com/
> > >> >
> > >> > The main method for indexing the vectors is based on Locality
> Sensitive
> > >> > Hashing <https://en.wikipedia.org/wiki/Locality-sensitive_hashing>.
> > >> > The general pattern is:
> > >> >
> > >> >    1. When indexing a vector, apply a hash function to it,
> producing a
> > >> set
> > >> >    of discrete hashes. Usually there are anywhere from 100 to 1000
> > >> hashes.
> > >> >    Similar vectors are more likely to share hashes (i.e., similar
> > >> vectors
> > >> >    produce hash collisions).
> > >> >    2. Convert each hash to a byte array and store the byte array as
> a
> > >> >    Lucene Term at a specific field.
> > >> >    3. Store the complete vector (i.e. floating point numbers) in a
> > >> binary
> > >> >    doc values field.
> > >> >
> > >> > In other words, I'm converting each vector into a bag of words,
> though
> > >> the
> > >> > words have no semantic meaning.
> > >> >
> > >> > A query works as follows:
> > >> >
> > >> >    1. Given a query vector, apply the same hash function to produce
> a
> > >> set
> > >> >    of hashes.
> > >> >    2. Convert each hash to a byte array and create a Term.
> > >> >    3. Build and run a BooleanQuery with a clause for each Term. Each
> > >> clause
> > >> >    looks like this: `new BooleanClause(new ConstantScoreQuery(new
> > >> >    TermQuery(new Term(field, new BytesRef(hashValue.toByteArray))),
> > >> >    BooleanClause.Occur.SHOULD))`.
> > >> >    4. As the BooleanQuery produces results, maintain a fixed-size
> heap
> > >> of
> > >> >    its scores. For any score exceeding the min in the heap, load its
> > >> vector
> > >> >    from the binary doc values, compute the exact similarity, and
> update
> > >> the
> > >> >    heap. Otherwise the vector gets a score of 0.
> > >> >
> > >> > When profiling my benchmarks with VisualVM, I've found the
> Elasticsearch
> > >> > search threads spend > 50% of the runtime in these two methods:
> > >> >
> > >> >    - org.apache.lucene.search.DisiPriorityQueue.downHeap (~58% of
> > >> runtime)
> > >> >    - org.apache.lucene.search.DisjunctionDISIApproximation.nextDoc
> (~8%
> > >> of
> > >> >    runtime)
> > >> >
> > >> > So the time seems to be dominated by collecting and ordering the
> results
> > >> > produced by the BooleanQuery from step 3 above.
> > >> > The exact similarity computation is only about 15% of the runtime.
> If I
> > >> > disable it entirely, I still see the same bottlenecks in VisualVM.
> > >> > Reducing the number of hashes yields roughly linear scaling (i.e.,
> 400
> > >> > hashes take ~2x longer than 200 hashes).
> > >> >
> > >> > The use case seems different to text search in that there's no
> semantic
> > >> > meaning to the terms, their length, their ordering, their stems,
> etc.
> > >> > I basically just need the index to be a rudimentary HashMap, and I
> only
> > >> > care about the scores for the top k results.
> > >> > With that in mind, I've made the following optimizations:
> > >> >
> > >> >    - Disabled tokenization on the FieldType (setTokenized(false))
> > >> >    - Disabled norms on the FieldType (setOmitNorms(true))
> > >> >    - Set similarity to BooleanSimilarity on the elasticsearch
> > >> >    MappedFieldType
> > >> >    - Set index options to IndexOptions.Docs.
> > >> >    - Used the MoreLikeThis heuristic to pick a subset of terms. This
> > >> >    understandably only yields a speedup proportional to the number
> of
> > >> >    discarded terms.
> > >> >
> > >> > I'm using Elasticsearch version 7.6.2 with Lucene 8.4.0.
> > >> > The main query implementation is here
> > >> > <
> > >>
> https://github.com/alexklibisz/elastiknn/blob/c951cf562ab0f911ee760c8be47c19aba98504b9/plugin/src/main/scala/com/klibisz/elastiknn/query/LshQuery.scala
> > >> >
> > >> > .
> > >> > <
> > >>
> https://github.com/alexklibisz/elastiknn/blob/c951cf562ab0f911ee760c8be47c19aba98504b9/plugin/src/main/scala/com/klibisz/elastiknn/query/LshQuery.scala
> > >> >
> > >> > The actual query that gets executed by Elasticsearch is
> instantiated on
> > >> line
> > >> > 98
> > >> > <
> > >>
> https://github.com/alexklibisz/elastiknn/blob/c951cf562ab0f911ee760c8be47c19aba98504b9/plugin/src/main/scala/com/klibisz/elastiknn/query/LshQuery.scala#L98
> > >> >
> > >> > .
> > >> > It's in Scala but all of the Java query classes should look
> familiar.
> > >> >
> > >> > Maybe there are some settings that I'm not aware of?
> > >> > Maybe I could optimize this by implementing a custom query or
> scorer?
> > >> > Maybe there's just no way to speed this up?
> > >> >
> > >> > I appreciate any input, examples, links, etc.. :)
> > >> > Also, let me know if I can provide any additional details.
> > >> >
> > >> > Thanks,
> > >> > Alex Klibisz
> > >>
> > >> ---------------------------------------------------------------------
> > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> > >>
> > >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: Optimizing a boolean query for 100s of term clauses

Reply via email to