[lingu-dev] Re: compound recognition and typos

Németh László Tue, 17 Feb 2009 01:26:46 -0800

Hi Ruud,

You are absolutely right. A lot of typos will be allowed by the
compound recognition, but Hunspell has already had the suggested
feature to forbid the ugliest spelling mistakes recognized by the
compound analysis: if the (pseudo) compound word can be produced from
a dictionary word (or from its affixed forms) by one of the REP
replacement rules, it won't be accepted by Hunspell. For example, one
of the most typical Hungarian spelling mistake is the i↔í replacement.
Using the


REP i í
REP í i

rules, the bad "szer+víz" or "elit+élt" compounds aren't accepted,
because the dictionary contains the words "szerviz" and "elítélt". You
may have to extend the REP rules also with similar 1-character
replacements to catch the most important spelling mistakes of your
language.

I think, for the average wordprocessing on a language with arbitrary
number of compound words is much better to use the compound
recognition feature of Hunspell. But for other tasks, especially to
check and edit artifically distorted texts, like the output of an OCR
program, you may need to add new REP rules (for the typical OCR
errors) or to offer an optional dictionary without compound
recognition.

Regards,
László


2009/2/17 R.J. Baars <[email protected]>:
> Laszlo,
>
> One of my colleages in OpenTaal (also project leader of OOo NL) is worried
> about the compounding supporting compounds that could easily be a mistake.
>
> Of course we can try and find these, and flag them as forbiddenword, but
> did you ever think of a function, detecting whether the compounded word is
> a possible type for a word that is in the list itself, and if zo, forbid
> it?
>
> Ruud
>
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[lingu-dev] Re: compound recognition and typos

Reply via email to