Re: Requiring multiple matches of a term

2011-08-22 Thread Simon Willnauer
On Mon, Aug 22, 2011 at 8:10 PM, Chris Hostetter
 wrote:
>
> : One simple way of doing this is maybe to write a wrapper for TermQuery
> : that only returns docs with a Term Frequency  > X as far as I
> : understand the question those terms don't have to be within a certain
> : window right?
>
> I don't think you could do it as a Query Wrapper -- it would have to be a
> Scorer wrapper, correct?

A query wrapper boils down to a scorer. if you don't want to change
lucene source you should simply write your own query wrapper.

simon
>
> That's the approach rmuir and i were discussing on friday, and i just
> posted a patch of the "guts" that could use some review...
>
>        https://issues.apache.org/jira/browse/LUCENE-3395
>
> ..the end goal would be options in TermQuery that would cause it to
> automaticly wrap it's Scorer in one of these, ala..
>
>        TermQuery q = new TermQuery(new Term("foo","bar"));
>        q.setMinFreq(4.0f);
>        q.setMaxFreq(1000.0f);
>
> ...and in solr, options for this could be added to the {!term} parser...
>
>        q={!term f=foo minTf=4.0 maxTf=1000.0}bar
>
> (could maybe add syntax to the regular query parser, but i think our
> strategic meta-character reserves are dangerously low)
>
>
> -Hoss
>


Re: Requiring multiple matches of a term

2011-08-22 Thread Chris Hostetter

: One simple way of doing this is maybe to write a wrapper for TermQuery
: that only returns docs with a Term Frequency  > X as far as I
: understand the question those terms don't have to be within a certain
: window right?

I don't think you could do it as a Query Wrapper -- it would have to be a 
Scorer wrapper, correct?

That's the approach rmuir and i were discussing on friday, and i just 
posted a patch of the "guts" that could use some review...

https://issues.apache.org/jira/browse/LUCENE-3395

..the end goal would be options in TermQuery that would cause it to 
automaticly wrap it's Scorer in one of these, ala..

TermQuery q = new TermQuery(new Term("foo","bar"));
q.setMinFreq(4.0f);
q.setMaxFreq(1000.0f);

...and in solr, options for this could be added to the {!term} parser...

q={!term f=foo minTf=4.0 maxTf=1000.0}bar

(could maybe add syntax to the regular query parser, but i think our 
strategic meta-character reserves are dangerously low)


-Hoss


RE: Requiring multiple matches of a term

2011-08-21 Thread Michael Ryan
> One simple way of doing this is maybe to write a wrapper for TermQuery
> that only returns docs with a Term Frequency  > X as far as I
> understand the question those terms don't have to be within a certain
> window right?

Correct. Terms can be anywhere in the document. I figured term frequencies 
might be involved, but wasn't sure how to actually do this.

> Hmmm... i would think the phrase query approach should work, but it's
> totally possible that there's something odd in the way phrase queries
> work that could cause a problem -- the best way to sanity test something
> like this is to try a really small self contained example that you can post
> for other people to try.

I've been able to reduce it pretty far, but I don't have a totally 
self-contained example yet. I haven't tried it out yet on a stock build of Solr 
(I'm using 3.2 with various patches). Right now I'm inserting a few documents 
with a text field that contains "dog dog dog", then repeatedly running q="dog 
dog dog dog"~1 with the queryResultCache disabled. The query is not giving me 
the same results each time (!!!). Sometimes all the documents are returned, 
sometimes a subset is returned, and sometimes no documents are returned.

So far I've traced it down to the "repeats" array in 
SloppyPhraseScorer.initPhrasePositions() - depending on the order of the 
elements in this array, the document may or may not match. I think the 
HashSet.toArray() call is to blame here, but I don't yet fully understand the 
expected behavior of the initPhrasePositions function...

-Michael


Re: Requiring multiple matches of a term

2011-08-21 Thread Simon Willnauer
On Fri, Aug 19, 2011 at 6:26 PM, Michael Ryan  wrote:
> Is there a way to specify in a query that a term must match at least X times 
> in a document, where X is some value greater than 1?
>

One simple way of doing this is maybe to write a wrapper for TermQuery
that only returns docs with a Term Frequency  > X as far as I
understand the question those terms don't have to be within a certain
window right?

simon
> For example, I want to only get documents that contain the word "dog" three 
> times.  I've thought that using a proximity query with an arbitrary large 
> distance value might do it:
> "dog dog dog"~10
> And that does seem to return the results I expect.
>
> But when I try for more than three, I start getting unexpected result counts 
> as I change the proximity value:
> "dog dog dog dog"~10 returns 6403 results
> "dog dog dog dog"~20 returns 9291 results
> "dog dog dog dog"~30 returns 6395 results
>
> Anyone ever do something like this and know how I can accomplish this?
>
> -Michael
>


Re: Requiring multiple matches of a term

2011-08-19 Thread Chris Hostetter

FWIW: i think this is a really cool and interesting question.

: Is there a way to specify in a query that a term must match at least X 
: times in a document, where X is some value greater than 1?

at the moment, i think your "phrase query" approach is really the only 
viable way (allthough it did get me thinking about how hard it would be 
to implement this at a lower level ... i'll see if i can work out a patch)

: But when I try for more than three, I start getting unexpected result 
: counts as I change the proximity value:

Hmmm... i would think the phrase query approach should work, but it's 
totally possible that there's something odd in the way phrase queries work 
that could cause a problem -- the best way to sanity test something like 
this is to try a really small self contained example that you can post for 
other people to try.

If you said "2 clauses work, but not 3" i would guess that maybe there is 
an "terms out of order" type issue involved, but "3 works not 4" smells 
fishy.

-Hoss


Requiring multiple matches of a term

2011-08-19 Thread Michael Ryan
Is there a way to specify in a query that a term must match at least X times in 
a document, where X is some value greater than 1?

For example, I want to only get documents that contain the word "dog" three 
times.  I've thought that using a proximity query with an arbitrary large 
distance value might do it:
"dog dog dog"~10
And that does seem to return the results I expect.

But when I try for more than three, I start getting unexpected result counts as 
I change the proximity value:
"dog dog dog dog"~10 returns 6403 results
"dog dog dog dog"~20 returns 9291 results
"dog dog dog dog"~30 returns 6395 results

Anyone ever do something like this and know how I can accomplish this?

-Michael