Re: Queries for many terms

2015-11-03 Thread Alan Woodward
TermsQuery works by pulling the postings lists for each term and OR-ing them together to create a bitset, which is very memory-efficient but means that you don't know at doc collection time which term has actually matched. For your case you probably want to create a SpanOrQuery, and then iterate

Re: Queries for many terms

2015-11-02 Thread Upayavira
Let's say we're trying to do document to document matching (not with MLT). We have a shingling analysis chain. The query is a document, which is itself shingled. We then look up those shingles in the index. The % of shingles found is in some sense a marker as to the extent to which the documents ar

Re: Queries for many terms

2015-11-02 Thread Erick Erickson
Or a really simple--minded approach, just use the frequency as a ration of numFound to estimate terms. Doesn't work of course if you need precise counts. On Mon, Nov 2, 2015 at 9:50 AM, Doug Turnbull wrote: > How precise do you need to be? > > I wonder if you could efficiently approximate "numbe

Re: Queries for many terms

2015-11-02 Thread Doug Turnbull
How precise do you need to be? I wonder if you could efficiently approximate "number of matches" by getting the document frequency of each term. I realize this is an approximation, but the highest document frequency would be your floor. Let's say you have terms t1, t2, and t3 ... tn. t1 has highe

Queries for many terms

2015-11-02 Thread Upayavira
I have a scenario where I want to search for documents that contain many terms (maybe 100s or 1000s), and then know the number of terms that matched. I'm happy to implement this as a query object/parser. I understand that Lucene isn't well suited to this scenario. Any suggestions as to how to make