Re: Requiring multiple matches of a term

2011-08-22 Thread Chris Hostetter

: One simple way of doing this is maybe to write a wrapper for TermQuery
: that only returns docs with a Term Frequency   X as far as I
: understand the question those terms don't have to be within a certain
: window right?

I don't think you could do it as a Query Wrapper -- it would have to be a 
Scorer wrapper, correct?

That's the approach rmuir and i were discussing on friday, and i just 
posted a patch of the guts that could use some review...

https://issues.apache.org/jira/browse/LUCENE-3395

..the end goal would be options in TermQuery that would cause it to 
automaticly wrap it's Scorer in one of these, ala..

TermQuery q = new TermQuery(new Term(foo,bar));
q.setMinFreq(4.0f);
q.setMaxFreq(1000.0f);

...and in solr, options for this could be added to the {!term} parser...

q={!term f=foo minTf=4.0 maxTf=1000.0}bar

(could maybe add syntax to the regular query parser, but i think our 
strategic meta-character reserves are dangerously low)


-Hoss


Re: Requiring multiple matches of a term

2011-08-22 Thread Simon Willnauer
On Mon, Aug 22, 2011 at 8:10 PM, Chris Hostetter
hossman_luc...@fucit.org wrote:

 : One simple way of doing this is maybe to write a wrapper for TermQuery
 : that only returns docs with a Term Frequency   X as far as I
 : understand the question those terms don't have to be within a certain
 : window right?

 I don't think you could do it as a Query Wrapper -- it would have to be a
 Scorer wrapper, correct?

A query wrapper boils down to a scorer. if you don't want to change
lucene source you should simply write your own query wrapper.

simon

 That's the approach rmuir and i were discussing on friday, and i just
 posted a patch of the guts that could use some review...

        https://issues.apache.org/jira/browse/LUCENE-3395

 ..the end goal would be options in TermQuery that would cause it to
 automaticly wrap it's Scorer in one of these, ala..

        TermQuery q = new TermQuery(new Term(foo,bar));
        q.setMinFreq(4.0f);
        q.setMaxFreq(1000.0f);

 ...and in solr, options for this could be added to the {!term} parser...

        q={!term f=foo minTf=4.0 maxTf=1000.0}bar

 (could maybe add syntax to the regular query parser, but i think our
 strategic meta-character reserves are dangerously low)


 -Hoss



Re: Requiring multiple matches of a term

2011-08-21 Thread Simon Willnauer
On Fri, Aug 19, 2011 at 6:26 PM, Michael Ryan mr...@moreover.com wrote:
 Is there a way to specify in a query that a term must match at least X times 
 in a document, where X is some value greater than 1?


One simple way of doing this is maybe to write a wrapper for TermQuery
that only returns docs with a Term Frequency   X as far as I
understand the question those terms don't have to be within a certain
window right?

simon
 For example, I want to only get documents that contain the word dog three 
 times.  I've thought that using a proximity query with an arbitrary large 
 distance value might do it:
 dog dog dog~10
 And that does seem to return the results I expect.

 But when I try for more than three, I start getting unexpected result counts 
 as I change the proximity value:
 dog dog dog dog~10 returns 6403 results
 dog dog dog dog~20 returns 9291 results
 dog dog dog dog~30 returns 6395 results

 Anyone ever do something like this and know how I can accomplish this?

 -Michael



RE: Requiring multiple matches of a term

2011-08-21 Thread Michael Ryan
 One simple way of doing this is maybe to write a wrapper for TermQuery
 that only returns docs with a Term Frequency   X as far as I
 understand the question those terms don't have to be within a certain
 window right?

Correct. Terms can be anywhere in the document. I figured term frequencies 
might be involved, but wasn't sure how to actually do this.

 Hmmm... i would think the phrase query approach should work, but it's
 totally possible that there's something odd in the way phrase queries
 work that could cause a problem -- the best way to sanity test something
 like this is to try a really small self contained example that you can post
 for other people to try.

I've been able to reduce it pretty far, but I don't have a totally 
self-contained example yet. I haven't tried it out yet on a stock build of Solr 
(I'm using 3.2 with various patches). Right now I'm inserting a few documents 
with a text field that contains dog dog dog, then repeatedly running q=dog 
dog dog dog~1 with the queryResultCache disabled. The query is not giving me 
the same results each time (!!!). Sometimes all the documents are returned, 
sometimes a subset is returned, and sometimes no documents are returned.

So far I've traced it down to the repeats array in 
SloppyPhraseScorer.initPhrasePositions() - depending on the order of the 
elements in this array, the document may or may not match. I think the 
HashSet.toArray() call is to blame here, but I don't yet fully understand the 
expected behavior of the initPhrasePositions function...

-Michael


Re: Requiring multiple matches of a term

2011-08-19 Thread Chris Hostetter

FWIW: i think this is a really cool and interesting question.

: Is there a way to specify in a query that a term must match at least X 
: times in a document, where X is some value greater than 1?

at the moment, i think your phrase query approach is really the only 
viable way (allthough it did get me thinking about how hard it would be 
to implement this at a lower level ... i'll see if i can work out a patch)

: But when I try for more than three, I start getting unexpected result 
: counts as I change the proximity value:

Hmmm... i would think the phrase query approach should work, but it's 
totally possible that there's something odd in the way phrase queries work 
that could cause a problem -- the best way to sanity test something like 
this is to try a really small self contained example that you can post for 
other people to try.

If you said 2 clauses work, but not 3 i would guess that maybe there is 
an terms out of order type issue involved, but 3 works not 4 smells 
fishy.

-Hoss