Re: Queries for many terms

2015-11-03 Thread Alan Woodward
TermsQuery works by pulling the postings lists for each term and OR-ing them together to create a bitset, which is very memory-efficient but means that you don't know at doc collection time which term has actually matched. For your case you probably want to create a SpanOrQuery, and then

Re: Queries for many terms

2015-11-02 Thread Erick Erickson
Or a really simple--minded approach, just use the frequency as a ration of numFound to estimate terms. Doesn't work of course if you need precise counts. On Mon, Nov 2, 2015 at 9:50 AM, Doug Turnbull wrote: > How precise do you need to be? > > I wonder if

Re: Queries for many terms

2015-11-02 Thread Doug Turnbull
How precise do you need to be? I wonder if you could efficiently approximate "number of matches" by getting the document frequency of each term. I realize this is an approximation, but the highest document frequency would be your floor. Let's say you have terms t1, t2, and t3 ... tn. t1 has

Re: Queries for many terms

2015-11-02 Thread Upayavira
Let's say we're trying to do document to document matching (not with MLT). We have a shingling analysis chain. The query is a document, which is itself shingled. We then look up those shingles in the index. The % of shingles found is in some sense a marker as to the extent to which the documents

Queries for many terms

2015-11-02 Thread Upayavira
I have a scenario where I want to search for documents that contain many terms (maybe 100s or 1000s), and then know the number of terms that matched. I'm happy to implement this as a query object/parser. I understand that Lucene isn't well suited to this scenario. Any suggestions as to how to