Re: PhraseQuery and edit distance slightly confusing.

Doug Cutting Wed, 15 Mar 2006 11:22:42 -0800

Dawid Weiss wrote:

I get the concept implemented in PhraseQuery but isn't calling it anedit distance a little bit far fetched?


Yes, it should probably be called "edit-distance-like" or something.

Only the marginal elements(minimum and maximum distance from their respective query positions) aretaken into account. Consider this example:
phrase:     a  b  c  d
term pos:   0  1  2  3

document A: a  c  b  d
term pos:   0  1  2  3
pos. diff:  0 -1  1  0

=> slope = (1 - (-1)) = 2

document B: a  c  b  x  d
term pos:   0  1  2  3  4
pos. diff:  0 -1  1  -  1

=> slope = (1 - (-1) = 2

It's how it is currently implemented, isn't it?


That's correct.

The scoring difference(attached example) is different just because "document" lengths aredifferent, phrases themselves are scored identically even though Ibelieve B should be penalized. A simple way to do it would be includephrase length divided by the matching span length...

We could do this by adding more parameters to Similarity.sloppyFreq().For example, the signature could become:


public float sloppyFreq(int distance, int matchLength);

But what then would the criteria for matching at all be? Right now itis "distance <= slop", but, with this change, shouldn't it also takeinto account the match length?

but I'm guessingit's implemented like that for a reason, just didn't know what that
reason might be ;)

No particularly good reason. I was looking for a simple single measurethat incorporated both out-of-order and insertion. You argue that, whenboth are present, the penalty should be higher, which makes good sense.Right now the penalty is like the maximum error, but the sum of errorsin the match might be better.

To implement this, we could sum the absolute values of your "pos. diff"values. That would give an error value of 2 for document A and 3 fordocument B, as you desire. This would have the benefit of not changingthe signature of sloppyFreq() and would also still provide a clearcriteria for matching: totalError < slop. The biggest downside is thatit would not be back-compatible: documents which used to match withslop=2 would no longer. I don't think this is a huge problem, but itdoes warrant providing an option to restore the old behaviour.

There might be a slight performance impact. I think we could implementthis by simply decrementing and incrementing the totalError each timeSloppyPhraseScorer calls nextPosition() or firstPosition(). Does thatsound right to you?


Cheers,

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: PhraseQuery and edit distance slightly confusing.

Reply via email to