Hi Greg,
I created http://issues.apache.org/jira/browse/LUCENE-3120 for this problem,
and attached there a more general test that exposes this problem, based on
your test case.
I am not sure yet that this is indeed a problem to be fixed with regard to
span queries (see more there in JIRA) but at
Doron
We let our users decide whether they want to force the order or not, so
in effect they pass in "inOrder".
I would have to detect a repeated term and change the parameter as a
result of that in order to workround this - I'd rather not do that
though.
Thanks
Greg
-Original Message-
Hi Greg,
On Thu, May 19, 2011 at 12:26 PM, Gregory Tarr wrote:
> We let our users decide whether they want to force the order or not, so
> in effect they pass in "inOrder".
>
> I would have to detect a repeated term and change the parameter as a
> result of that in order to workround this - I'd r
I believe Lucene already does this, with the 'coord' factor in
BooleanQuery, which is on by default (ie, if you just "new
BooleanQuery()").
Ie your doc c will get a coord factor of 1.0, doc b gets 0.666..., doc
a gets 0..
That said, if the term freq is high enough (ie doc a has nacho 4
times)
A little test shows that Mike is correct and lucene does already do this.
With norms (default)
nacho foo bar, score=0.8660254
foo bar bar, score=0.46461558
nacho nacho nacho nacho, score=0.19245009
Without norms
nacho foo bar, score=1.7320508
foo bar bar, score=0.92923117
nacho nacho nacho
Of course IDF is a factor too meaning a match on a single rare (to the overall
index) term may be worth more than a match on 2 different common (to the index)
terms.
As Ian suggests a custom Similarity implementation can be used to tune this out.
- Original Message
From: Ian Lea
To: j
Thanks Paul,
I do not know what duplicates are in this case and it is the denominator of
the TF that bothers me more than the numerator of the TF (if that is in fact
what you are suggesting).
What have been the effects of ignoring the IDF? When is it appropriate. It
would seem that by doing so ra
Hi Rich,
If I understand correctly you are concerned that short documents are
preferred too much over long ones, is this really the case?
It would help to understand what goes on to look at the Explanation of the
score for say two result documents - one that you think is ranked too low,
and one tha