Hi all, Marcin Miłkowski wrote:
... >> Correct spelling: déterrer >> >> Hunspell suggestions: >> détérer --> déterrer is not suggested (8th position if line KEY removed) >> détèrer --> déterrer is at 4th position >> détêrer --> déterrer is at 3rd position >> détërer --> déterrer is at 2nd position >> ^^ >> || >> |`-----> one r is missing >> | >> `------> should be e > > In all these cases, two letters must be replaced. In terms of > Levenshtein distance (the standard measure of the difference between > strings), the correct form is actually a "worse" suggestion than other > forms that require a change of only one letter. Of course, the space of > corrections is not as uniform as Levenshtein suggests, so some changes > should be given preference to others. I don't know how to do that > besides changing the TRY line and REPs. > > I'm only saying that it looks as if the Levenshtein distance was used > but there surely is another way to find better suggestions. Anyone? > For a reference: http://en.wikipedia.org/wiki/Levenshtein_distance Befor looking for a completely different solution I'd like to try experimenting with a slight modification of that algorithm Usually the Levenshtein distance is calculated by counting changes with the following weight (as in the link above) (A) - adding a character: +1 - deleting a character: +1 - changing a character: +1 But sometimes I have seen weights that prefer changes over insertions and deletions, in those cases the weights were like this: (B) - adding a character: +2 - deleting a character: +1 - changing a character: +2 Because of this, and since the actual problem is only with getting a better proposal if the character differs only in its 'decoration' I'd like to suggest trying the following idea: shifting the weights. Provided hunspell uses weights like this (the actual values do not matter!) - adding a character: +A - deleting a character: +D - changing a character: +C then the weights should be calculated like this instead - adding a character: +2*A - deleting a character: +2*D - changing a character: +2*C, if the characters differ not just by 'decoration' - changing a character: +C, if the characters differ *only in* the decoration That way changes like é to c will have double weight and changes like e to é will have only single weight. Thus the latter changes should be preferable compared to other changes and therefor the respective suggestions being higher up in the list of proposals. Obviously the suggestion mechanism now needs to accept words of twice the Levenshtein distance than before. Regards, Thomas --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
