Thomas Lange - Sun Germany - ham02 - Hamburg a écrit :

Because of this, and since the actual problem is only with getting a
better proposal if the character differs only in its 'decoration' I'd
like to suggest trying the following idea: shifting the weights.

Provided hunspell uses weights like this (the actual values do not matter!)
  - adding a character:   +A
  - deleting a character: +D
  - changing a character: +C
then the weights should be calculated like this instead
  - adding a character:   +2*A
  - deleting a character: +2*D
  - changing a character: +2*C, if the characters differ not just by
'decoration'
  - changing a character: +C, if the characters differ *only in* the
decoration

That way changes like é to c will have double weight and changes like e
to é will have only single weight. Thus the latter changes should be
preferable compared to other changes and therefor the respective
suggestions being higher up in the list of proposals.

I don't know if it's feasible, but we could also consider that adding or removing a letter has only one weight if a line REP says so.

Examples:
REP r rr
REP rr r

r --> rr     weight:1
rr --> r     weight:1
r --> f      weight:2 or more
fr --> f     weight:2 or more

Actually, what I want to know is why the lines MAP, which usually describe which letters are similar (with diacritics or "decorations"), and the lines REP (which usually describe common replacements) seem to be ignored when the distance goes beyond 1.

Imho, these lines offer a better way to get a correct spelling than simply calculating the Levenshtein distance, however we calculate it.

I have a suggestion:
Maybe it would improve the spellchecker suggestions if it tried first to apply the rules MAP and REP, without calculating anything. And if it does not find anything, try again with the Levenshtein distance.

Exemple:
_gommer_ and _fumer_ are both at a Levenshtein distance of 2 from _fûmmer_ (wrong spelling), but Hunspell could find the correct spelling _fumer_, just by shifting letters as described in lines MAP and REP.

Regards,
Olivier

--

== N'écrivez pas à cette adresse. Réservée aux listes de discussion. ==
** Do not reply at this address. Mailing-list only. **

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to