Dear All,

The investigation below will be interesting for those, whose language 
actively uses compound words (Hungarian, German, Dutch, Swedish, ...)

I now closed the investigation of Hungarian compound words creation.

The results are  in 
http://tkltrans.sourceforge.net/tklspell/compound.htm#c01

The full list of bad words is on
http://tkltrans.sf.net/magyar/bbbad2.txt.gz, 
http://tkltrans.sf.net/magyar/betus94okx.txt.gz 
http://tkltrans.sf.net/magyar/sav2koz.txt.gz

The conclusion:

The checking of a 30 million word size corpus proved, that the words, that are 
automatically created compound words, contain approximately 10% wrong words of 
the above types. Automatic word compounding is a quick a dirty mechanizm, that 
is not capable to create quality word lists and therefore quality spell 
checking. Manually created word lists, if carefully created, tend to contain 
less than 0.5% wrong words.

The number of words.
-------------------
Here the in reality bad words:

[EMAIL PROTECTED] nagy_fajlok]$ wc el_bad/*
  7889   7889  95583 el_bad/bbbad2.txt
 38175  38175 467018 el_bad/betus94okx.txt
  8401   8401 142604 el_bad/sav2koz.txt  -- long words , over 15 chars long
 54465  54465 705205 total

Here the words, that the checker using  compounder thinks, they are good:

[EMAIL PROTECTED] nagy_fajlok]$ wc NL_jo/*
  64054   64054  802530 NL_jo/bbbad2.txt
 341204  341204 4309353 NL_jo/betus94okx.txt
 135044  135044 2302654 NL_jo/sav2koz.txt   -- long words , over 15 chars long
 540302  540302 7414537 total
 
The shorter the words, the more catastrophic the error rate.

I assume, that the results are in German analogous, because the first
investigations showed that quite clearly, if I have time, I 
will look also into that somewhat deeper.

Regards: Eleonora


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to