TermsQuery works by pulling the postings lists for each term and OR-ing them
together to create a bitset, which is very memory-efficient but means that you
don't know at doc collection time which term has actually matched.
For your case you probably want to create a SpanOrQuery, and then
Or a really simple--minded approach, just use the frequency
as a ration of numFound to estimate terms.
Doesn't work of course if you need precise counts.
On Mon, Nov 2, 2015 at 9:50 AM, Doug Turnbull
wrote:
> How precise do you need to be?
>
> I wonder if
How precise do you need to be?
I wonder if you could efficiently approximate "number of matches" by
getting the document frequency of each term. I realize this is an
approximation, but the highest document frequency would be your floor.
Let's say you have terms t1, t2, and t3 ... tn. t1 has
Let's say we're trying to do document to document matching (not with
MLT). We have a shingling analysis chain. The query is a document, which
is itself shingled. We then look up those shingles in the index. The %
of shingles found is in some sense a marker as to the extent to which
the documents
I have a scenario where I want to search for documents that contain many
terms (maybe 100s or 1000s), and then know the number of terms that
matched. I'm happy to implement this as a query object/parser.
I understand that Lucene isn't well suited to this scenario. Any
suggestions as to how to