Re: Requiring multiple matches of a term
On Mon, Aug 22, 2011 at 8:10 PM, Chris Hostetter wrote: > > : One simple way of doing this is maybe to write a wrapper for TermQuery > : that only returns docs with a Term Frequency > X as far as I > : understand the question those terms don't have to be within a certain > : window right? > > I don't think you could do it as a Query Wrapper -- it would have to be a > Scorer wrapper, correct? A query wrapper boils down to a scorer. if you don't want to change lucene source you should simply write your own query wrapper. simon > > That's the approach rmuir and i were discussing on friday, and i just > posted a patch of the "guts" that could use some review... > > https://issues.apache.org/jira/browse/LUCENE-3395 > > ..the end goal would be options in TermQuery that would cause it to > automaticly wrap it's Scorer in one of these, ala.. > > TermQuery q = new TermQuery(new Term("foo","bar")); > q.setMinFreq(4.0f); > q.setMaxFreq(1000.0f); > > ...and in solr, options for this could be added to the {!term} parser... > > q={!term f=foo minTf=4.0 maxTf=1000.0}bar > > (could maybe add syntax to the regular query parser, but i think our > strategic meta-character reserves are dangerously low) > > > -Hoss >
Re: Requiring multiple matches of a term
: One simple way of doing this is maybe to write a wrapper for TermQuery : that only returns docs with a Term Frequency > X as far as I : understand the question those terms don't have to be within a certain : window right? I don't think you could do it as a Query Wrapper -- it would have to be a Scorer wrapper, correct? That's the approach rmuir and i were discussing on friday, and i just posted a patch of the "guts" that could use some review... https://issues.apache.org/jira/browse/LUCENE-3395 ..the end goal would be options in TermQuery that would cause it to automaticly wrap it's Scorer in one of these, ala.. TermQuery q = new TermQuery(new Term("foo","bar")); q.setMinFreq(4.0f); q.setMaxFreq(1000.0f); ...and in solr, options for this could be added to the {!term} parser... q={!term f=foo minTf=4.0 maxTf=1000.0}bar (could maybe add syntax to the regular query parser, but i think our strategic meta-character reserves are dangerously low) -Hoss
RE: Requiring multiple matches of a term
> One simple way of doing this is maybe to write a wrapper for TermQuery > that only returns docs with a Term Frequency > X as far as I > understand the question those terms don't have to be within a certain > window right? Correct. Terms can be anywhere in the document. I figured term frequencies might be involved, but wasn't sure how to actually do this. > Hmmm... i would think the phrase query approach should work, but it's > totally possible that there's something odd in the way phrase queries > work that could cause a problem -- the best way to sanity test something > like this is to try a really small self contained example that you can post > for other people to try. I've been able to reduce it pretty far, but I don't have a totally self-contained example yet. I haven't tried it out yet on a stock build of Solr (I'm using 3.2 with various patches). Right now I'm inserting a few documents with a text field that contains "dog dog dog", then repeatedly running q="dog dog dog dog"~1 with the queryResultCache disabled. The query is not giving me the same results each time (!!!). Sometimes all the documents are returned, sometimes a subset is returned, and sometimes no documents are returned. So far I've traced it down to the "repeats" array in SloppyPhraseScorer.initPhrasePositions() - depending on the order of the elements in this array, the document may or may not match. I think the HashSet.toArray() call is to blame here, but I don't yet fully understand the expected behavior of the initPhrasePositions function... -Michael
Re: Requiring multiple matches of a term
On Fri, Aug 19, 2011 at 6:26 PM, Michael Ryan wrote: > Is there a way to specify in a query that a term must match at least X times > in a document, where X is some value greater than 1? > One simple way of doing this is maybe to write a wrapper for TermQuery that only returns docs with a Term Frequency > X as far as I understand the question those terms don't have to be within a certain window right? simon > For example, I want to only get documents that contain the word "dog" three > times. I've thought that using a proximity query with an arbitrary large > distance value might do it: > "dog dog dog"~10 > And that does seem to return the results I expect. > > But when I try for more than three, I start getting unexpected result counts > as I change the proximity value: > "dog dog dog dog"~10 returns 6403 results > "dog dog dog dog"~20 returns 9291 results > "dog dog dog dog"~30 returns 6395 results > > Anyone ever do something like this and know how I can accomplish this? > > -Michael >
Re: Requiring multiple matches of a term
FWIW: i think this is a really cool and interesting question. : Is there a way to specify in a query that a term must match at least X : times in a document, where X is some value greater than 1? at the moment, i think your "phrase query" approach is really the only viable way (allthough it did get me thinking about how hard it would be to implement this at a lower level ... i'll see if i can work out a patch) : But when I try for more than three, I start getting unexpected result : counts as I change the proximity value: Hmmm... i would think the phrase query approach should work, but it's totally possible that there's something odd in the way phrase queries work that could cause a problem -- the best way to sanity test something like this is to try a really small self contained example that you can post for other people to try. If you said "2 clauses work, but not 3" i would guess that maybe there is an "terms out of order" type issue involved, but "3 works not 4" smells fishy. -Hoss
Requiring multiple matches of a term
Is there a way to specify in a query that a term must match at least X times in a document, where X is some value greater than 1? For example, I want to only get documents that contain the word "dog" three times. I've thought that using a proximity query with an arbitrary large distance value might do it: "dog dog dog"~10 And that does seem to return the results I expect. But when I try for more than three, I start getting unexpected result counts as I change the proximity value: "dog dog dog dog"~10 returns 6403 results "dog dog dog dog"~20 returns 9291 results "dog dog dog dog"~30 returns 6395 results Anyone ever do something like this and know how I can accomplish this? -Michael