Re: Requiring multiple matches of a term
: One simple way of doing this is maybe to write a wrapper for TermQuery : that only returns docs with a Term Frequency X as far as I : understand the question those terms don't have to be within a certain : window right? I don't think you could do it as a Query Wrapper -- it would have to be a Scorer wrapper, correct? That's the approach rmuir and i were discussing on friday, and i just posted a patch of the guts that could use some review... https://issues.apache.org/jira/browse/LUCENE-3395 ..the end goal would be options in TermQuery that would cause it to automaticly wrap it's Scorer in one of these, ala.. TermQuery q = new TermQuery(new Term(foo,bar)); q.setMinFreq(4.0f); q.setMaxFreq(1000.0f); ...and in solr, options for this could be added to the {!term} parser... q={!term f=foo minTf=4.0 maxTf=1000.0}bar (could maybe add syntax to the regular query parser, but i think our strategic meta-character reserves are dangerously low) -Hoss
Re: Requiring multiple matches of a term
On Mon, Aug 22, 2011 at 8:10 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : One simple way of doing this is maybe to write a wrapper for TermQuery : that only returns docs with a Term Frequency X as far as I : understand the question those terms don't have to be within a certain : window right? I don't think you could do it as a Query Wrapper -- it would have to be a Scorer wrapper, correct? A query wrapper boils down to a scorer. if you don't want to change lucene source you should simply write your own query wrapper. simon That's the approach rmuir and i were discussing on friday, and i just posted a patch of the guts that could use some review... https://issues.apache.org/jira/browse/LUCENE-3395 ..the end goal would be options in TermQuery that would cause it to automaticly wrap it's Scorer in one of these, ala.. TermQuery q = new TermQuery(new Term(foo,bar)); q.setMinFreq(4.0f); q.setMaxFreq(1000.0f); ...and in solr, options for this could be added to the {!term} parser... q={!term f=foo minTf=4.0 maxTf=1000.0}bar (could maybe add syntax to the regular query parser, but i think our strategic meta-character reserves are dangerously low) -Hoss
Re: Requiring multiple matches of a term
On Fri, Aug 19, 2011 at 6:26 PM, Michael Ryan mr...@moreover.com wrote: Is there a way to specify in a query that a term must match at least X times in a document, where X is some value greater than 1? One simple way of doing this is maybe to write a wrapper for TermQuery that only returns docs with a Term Frequency X as far as I understand the question those terms don't have to be within a certain window right? simon For example, I want to only get documents that contain the word dog three times. I've thought that using a proximity query with an arbitrary large distance value might do it: dog dog dog~10 And that does seem to return the results I expect. But when I try for more than three, I start getting unexpected result counts as I change the proximity value: dog dog dog dog~10 returns 6403 results dog dog dog dog~20 returns 9291 results dog dog dog dog~30 returns 6395 results Anyone ever do something like this and know how I can accomplish this? -Michael
RE: Requiring multiple matches of a term
One simple way of doing this is maybe to write a wrapper for TermQuery that only returns docs with a Term Frequency X as far as I understand the question those terms don't have to be within a certain window right? Correct. Terms can be anywhere in the document. I figured term frequencies might be involved, but wasn't sure how to actually do this. Hmmm... i would think the phrase query approach should work, but it's totally possible that there's something odd in the way phrase queries work that could cause a problem -- the best way to sanity test something like this is to try a really small self contained example that you can post for other people to try. I've been able to reduce it pretty far, but I don't have a totally self-contained example yet. I haven't tried it out yet on a stock build of Solr (I'm using 3.2 with various patches). Right now I'm inserting a few documents with a text field that contains dog dog dog, then repeatedly running q=dog dog dog dog~1 with the queryResultCache disabled. The query is not giving me the same results each time (!!!). Sometimes all the documents are returned, sometimes a subset is returned, and sometimes no documents are returned. So far I've traced it down to the repeats array in SloppyPhraseScorer.initPhrasePositions() - depending on the order of the elements in this array, the document may or may not match. I think the HashSet.toArray() call is to blame here, but I don't yet fully understand the expected behavior of the initPhrasePositions function... -Michael
Re: Requiring multiple matches of a term
FWIW: i think this is a really cool and interesting question. : Is there a way to specify in a query that a term must match at least X : times in a document, where X is some value greater than 1? at the moment, i think your phrase query approach is really the only viable way (allthough it did get me thinking about how hard it would be to implement this at a lower level ... i'll see if i can work out a patch) : But when I try for more than three, I start getting unexpected result : counts as I change the proximity value: Hmmm... i would think the phrase query approach should work, but it's totally possible that there's something odd in the way phrase queries work that could cause a problem -- the best way to sanity test something like this is to try a really small self contained example that you can post for other people to try. If you said 2 clauses work, but not 3 i would guess that maybe there is an terms out of order type issue involved, but 3 works not 4 smells fishy. -Hoss