Re: Queries for many terms

Erick Erickson Mon, 02 Nov 2015 10:48:11 -0800

Or a really simple--minded approach, just use the frequency
as a ration of numFound to estimate terms.


Doesn't work of course if you need precise counts.

On Mon, Nov 2, 2015 at 9:50 AM, Doug Turnbull
<dturnb...@opensourceconnections.com> wrote:
> How precise do you need to be?
>
> I wonder if you could efficiently approximate "number of matches" by
> getting the document frequency of each term. I realize this is an
> approximation, but the highest document frequency would be your floor.
>
> Let's say you have terms t1, t2, and t3 ... tn. t1 has highest doc freq, tn
> lowest.
>
> OK the following algorithm could refine your floor
> - count = t1.docfreq
> - Then issue a query for NOT t1, this eliminates many candidate documents
> to improve performance
> - Build a bloom filter or other set-membership data structure for t2...tn
> https://en.wikipedia.org/wiki/Bloom_filter
> - In a PostFilter(?) Lucene Collector(?) scan each collected/returned
> document and do a set membership test against the bloom filter. If member,
> then increment your count.
>
> It's O(numDocs that don't match t1)
>
> This is me just thinking outloud, but maybe it'll trigger thoughts in
> others...
> -Doug
>
>
> On Mon, Nov 2, 2015 at 12:14 PM, Upayavira <u...@odoko.co.uk> wrote:
>
>> I have a scenario where I want to search for documents that contain many
>> terms (maybe 100s or 1000s), and then know the number of terms that
>> matched. I'm happy to implement this as a query object/parser.
>>
>> I understand that Lucene isn't well suited to this scenario. Any
>> suggestions as to how to make this more efficient? Does the TermsQuery
>> work differently from the BooleanQuery regarding large numbers of terms?
>>
>> Upayavira
>>
>
>
>
> --
> *Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections
> <http://opensourceconnections.com>, LLC | 240.476.9983
> Author: Relevant Search <http://manning.com/turnbull>
> This e-mail and all contents, including attachments, is considered to be
> Company Confidential unless explicitly stated otherwise, regardless
> of whether attachments are marked as such.

Re: Queries for many terms

Reply via email to