Hi,

2009/2/24 Olivier R. <[email protected]>:
> Hi,
>
> I would like to understand how hunspell tries to suggest the right spelling.

It uses a mix of different suggestion algorithms (some of them are
dictionary based).
The base TRY algorithm searches all suggestions with 1 Levenshtein
distance from the misspelled word.

>
> Here is some examples of the strange behaviour we get:
>
>
> ***** example 1 *****
> _déterrer_ is the correct spelling of a verb ("to dig up" in English)
>
> a. If I write: _détérer_
> Hunspell suggests: déférer, détirer
>
> b. If I write: _détèrer_ (the second accent is different)
> Hunspell suggests: détirer, délétère, détourer, _déterrer_ and a lot of
> others words.
> The fourth word is the correct one.
>
> But why Hunspell is able to suggest it if I write _détèrer_, but is not able
> to do the same if I write détérer.

In the case of the successful substitution by the TRY algorithm, there
is no dictionary based search.
The chief reason is the time efficiency, so the future versions of
Hunspell won't contain this limitation.
In fact, next Hunspell in OOo uses dictionary based search despite of
the successful TRY suggestions, when these TRY suggestions contain
only deletions and insertions:

$ ~/hunspell-1.1.12/src/tools/hunspell -d fr_FR
Hunspell 1.1.12
dééterrer
& dééterrer 1 0: déterrer

éterrer
& éterrer 2 0: terrer, déterrer

$ ~/hunspell-1.2.8/src/tools/hunspell -d fr_FR
Hunspell 1.2.8
dééterrer
& dééterrer 5 0: déterrer, déterreur, déterrement, déterrage, déterrée

éterrer
& éterrer 6 0: terrer, déterrer, déterreur, éternuer, éterniser, éternelle

>
> e, é and è are defined as similar characters with the line
> MAP eéèêë
>
> If I write _détêrer_, _déterrer_ is suggested at the third position.
> If I write _détërer_, _déterrer_ is suggested at the second position.
>
>
> ***** example 2 *****
> _fumer_ is the correct spelling of a verb ("to smoke" in English)
>
> If I write: _fûmmer_
> Hunspell suggests: gemmer, nommer, gommer, sommer, pommer, fermer, frimer,
> former, filmer, fûtier, enflammer, emmerdé, drummer, commerce, emmerde
>
> There is not one word close to the right one.
>
> It should be easy for Hunspell to suggest _fumer_ with the lines:
> MAP uùûü
> REP mm m
>
> But Hunspell believes that _gemmer_ is closer to _fûmmer_ than _fumer_.
> Why?

Unfortunately, MAP and REP data haven't used by the dictionary based
suggestion algorithm yet, so û is a quite different character for the
n-gram dictionary based suggestion algorithm, also words with "mm"
have greater n-gram values here, than words with "me". The long n-gram
value of gemmer etc., and the equal word length and characters in the
same character positions of fermer, etc. wins.

Using PHONE could help here, but the PHONE algorithm doesn't support
accented characters in the recent Hunspell version. I hope, this will
be fixed within a few months. Also

Best regards,
László

>
>
> ***** end of examples *****
>
>
> I just don't understand how Hunspell makes suggestions.
>
> I tried for example to remove the line KEY (see the Annex below).
> With _détérer_, Hunspell suggests now a lot of words instead of 2, and the
> right one (_déterrer_) is at the eighth position.
> But it does not change anything else for the others wrong spelling and for
> _fûmmer_.
>
>
> Best regards,
> Olivier
>
> Annex: Rules about suggestions in the French affixes file:
>
> TRY
> aàâäbcçdeéèêëfghiîïjklmnoôöpqrstuùûüvwxyzæœAÀÂÄBCÇDEÉÈÊËFGHIÎÏJKLMNOÔÖPQRSTUÙÛÜVWXYZÆŒáíÿñåóşăã
>
> MAP aàâä
> MAP eéèêë
> MAP iîïy
> MAP oôö
> MAP uùûü
> MAP cç
> MAP AÀÂÄ
> MAP EÉÈÊË
> MAP IÎÏY
> MAP OÔÖ
> MAP UÙÛÜ
> MAP CÇ
>
> REP f ph
> REP ph f
> REP c qu
> REP qu c
> REP k qu
> REP qu k
> REP x ct
> REP ct x
> REP bb b
> REP b bb
> REP cc c
> REP c cc
> REP ff f
> REP f ff
> REP ll l
> REP l ll
> REP mm m
> REP m mm
> REP nn n
> REP n nn
> REP pp p
> REP p pp
> REP rr r
> REP r rr
> REP ss s
> REP s ss
> REP ss c
> REP c ss
> REP ss ç
> REP ç ss
> REP tt t
> REP t tt
> REP œ oe
> REP oe œ
> REP æ ae
> REP ae æ
> REP ai é
> REP é ai
> REP ai è
> REP è ai
> REP ai ê
> REP ê ai
> REP ei é
> REP é ei
> REP ei è
> REP è ei
> REP ei ê
> REP ê ei
> REP o au
> REP au o
> REP o eau
> REP eau o
>
> KEY
> azertyuiop|qsdfghjklmù|wxcvbn|aéz|yèu|iço|oàp|aqz|zse|edr|rft|tgy|yhu|uji|iko|olpm|qws|sxd|dcf|fvg|gbh|hnj
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to