On Thu, Sep 5, 2013 at 9:46 AM, Adrien Grand <jpou...@gmail.com> wrote:
> Hi, > > On Thu, Sep 5, 2013 at 9:28 AM, Kristofer Karlsson <k...@spotify.com> > wrote: > > I have a use case where some of my documents have duplicate terms in > > various fields or within the same field. > > > > For an example, I may have a million documents with just the term "foo" > in > > field A, and one particular document with the term "foo" in both field A > > and B, or have two terms "foo" in the same field. > > > > If I search for "foo foo" I would like to filter out all the documents > with > > only one matching term - is this possible? > > I don't think we have existing queries that allow for doing it > efficiently (if someone reads this and knows it is wrong, please > correct!). However, it should be doable to implement such a query > rather easily by iterating over the postings lists of the 'foo' term > in all the fields you are interested in, suming up frequencies (the > index must have been created with IndexOptions.DOCS_AND_FREQS or > higher) and only keeping documents whose sum of frequencies is at > least 2. > > -- > Adrien > > Thanks for the quick reply! So I'd have to manually count each term after tokenizing the search query and keep a map of term to count. I will definitely try this. --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >