Hi Tommaso, thanks for the input and links! I'll add your paper to my
literature review.
So far I've seen very promising results from modifying the TermInSetQuery.
It was pretty simple to keep a map of `doc id -> matched term count` and
then only evaluate the exact similarity on the top k doc ids.
On Wed, 2020-06-24 at 13:46 -0400, Alex K wrote:
> My implementation isn't specific to any particular dataset or access
> pattern (i.e. infinite vs. subset).
Without a clearly defined use case, I would say that the sequential
scan approach is not the right one: As these things goes, someone will
hi Alex,
I had worked on a similar problem directly on Lucene (within Anserini
toolkit) using LSH fingerprints of tokenized feature vector values.
You can find code at [1] and some information on the Anserini documentation
page [2] and in a short preprint [3].
As a side note my current thinking is
Hi Toke. Indeed a nice coincidence. It's an interesting and fun problem
space!
My implementation isn't specific to any particular dataset or access
pattern (i.e. infinite vs. subset).
So far the plugin supports exact L1, L2, Jaccard, Hamming, and Angular
similarities with LSH variants for all but
Thanks Michael. I managed to translate the TermInSetQuery into Scala
yesterday so now I can modify it in my codebase. This seems promising so
far. Fingers crossed there's a way to maintain scores without basically
converging to the BooleanQuery implementation.
- AK
On Wed, Jun 24, 2020 at 8:40 AM
On Tue, 2020-06-23 at 09:50 -0400, Alex K wrote:
> I'm working on an Elasticsearch plugin (using Lucene internally) that
> allows users to index numerical vectors and run exact and approximate
> k-nearest-neighbors similarity queries.
Quite a coincidence. I'm looking into the same thing :-)
> 1
Yeah that will require some changes since what it does currently is to
maintain a bitset, and or into it repeatedly (once for each term's
docs). To maintain counts, you'd need a counter per doc (rather than a
bit), and you might lose some of the speed...
On Tue, Jun 23, 2020 at 8:52 PM Alex K wro
The TermsInSetQuery is definitely faster. Unfortunately it doesn't seem to
return the number of terms that matched in a given document. Rather it just
returns the boost value. I'll look into copying/modifying the internals to
return the number of matched terms.
Thanks
- AK
On Tue, Jun 23, 2020 at
Hi Michael,
Thanks for the quick response!
I will look into the TermInSetQuery.
My usage of "heap" might've been confusing.
I'm using a FunctionScoreQuery from Elasticsearch.
This gets instantiated with a Lucene query, in this case the boolean query
as I described it, as well as a custom ScoreFun
You might consider using a TermInSetQuery in place of a BooleanQuery
for the hashes (since they are all in the same field).
I don't really understand why you are seeing so much cost in the heap
- it's sounds as if you have a single heap with mixed scores - those
generated by the BooleanQuery and t
Hello all,
I'm working on an Elasticsearch plugin (using Lucene internally) that
allows users to index numerical vectors and run exact and approximate
k-nearest-neighbors similarity queries.
I'd like to get some feedback about my usage of BooleanQueries and
TermQueries, and see if there are any op
11 matches
Mail list logo