Re: Lucene handling of duplicate terms

Kristofer Karlsson Thu, 05 Sep 2013 00:58:13 -0700

On Thu, Sep 5, 2013 at 9:46 AM, Adrien Grand <[email protected]> wrote:


> Hi,
>
> On Thu, Sep 5, 2013 at 9:28 AM, Kristofer Karlsson <[email protected]>
> wrote:
> > I have a use case where some of my documents have duplicate terms in
> > various fields or within the same field.
> >
> > For an example, I may have a million documents with just the term "foo"
> in
> > field A, and one particular document with the term "foo" in both field A
> > and B, or have two terms "foo" in the same field.
> >
> > If I search for "foo foo" I would like to filter out all the documents
> with
> > only one matching term - is this possible?
>
> I don't think we have existing queries that allow for doing it
> efficiently (if someone reads this and knows it is wrong, please
> correct!). However, it should be doable to implement such a query
> rather easily by iterating over the postings lists of the 'foo' term
> in all the fields you are interested in, suming up frequencies (the
> index must have been created with IndexOptions.DOCS_AND_FREQS or
> higher) and only keeping documents whose sum of frequencies is at
> least 2.
>
> --
> Adrien
>
> Thanks for the quick reply!
So I'd have to manually count each term after tokenizing the search query
and keep a map of term to count. I will definitely try this.

---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: Lucene handling of duplicate terms

Reply via email to