Dawid Weiss wrote:
I get the concept implemented in PhraseQuery but isn't calling it an edit distance a little bit far fetched?

Yes, it should probably be called "edit-distance-like" or something.

Only the marginal elements (minimum and maximum distance from their respective query positions) are taken into account. Consider this example:

phrase:     a  b  c  d
term pos:   0  1  2  3

document A: a  c  b  d
term pos:   0  1  2  3
pos. diff:  0 -1  1  0

=> slope = (1 - (-1)) = 2

document B: a  c  b  x  d
term pos:   0  1  2  3  4
pos. diff:  0 -1  1  -  1

=> slope = (1 - (-1) = 2

It's how it is currently implemented, isn't it?

That's correct.

The scoring difference (attached example) is different just because "document" lengths are different, phrases themselves are scored identically even though I believe B should be penalized. A simple way to do it would be include phrase length divided by the matching span length...

We could do this by adding more parameters to Similarity.sloppyFreq(). For example, the signature could become:

public float sloppyFreq(int distance, int matchLength);

But what then would the criteria for matching at all be? Right now it is "distance <= slop", but, with this change, shouldn't it also take into account the match length?

but I'm guessing it's implemented like that for a reason, just didn't know what that
reason might be ;)

No particularly good reason. I was looking for a simple single measure that incorporated both out-of-order and insertion. You argue that, when both are present, the penalty should be higher, which makes good sense. Right now the penalty is like the maximum error, but the sum of errors in the match might be better.

To implement this, we could sum the absolute values of your "pos. diff" values. That would give an error value of 2 for document A and 3 for document B, as you desire. This would have the benefit of not changing the signature of sloppyFreq() and would also still provide a clear criteria for matching: totalError < slop. The biggest downside is that it would not be back-compatible: documents which used to match with slop=2 would no longer. I don't think this is a huge problem, but it does warrant providing an option to restore the old behaviour.

There might be a slight performance impact. I think we could implement this by simply decrementing and incrementing the totalError each time SloppyPhraseScorer calls nextPosition() or firstPosition(). Does that sound right to you?

Cheers,

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to