Dear All, The investigation below will be interesting for those, whose language actively uses compound words (Hungarian, German, Dutch, Swedish, ...)
I now closed the investigation of Hungarian compound words creation. The results are in http://tkltrans.sourceforge.net/tklspell/compound.htm#c01 The full list of bad words is on http://tkltrans.sf.net/magyar/bbbad2.txt.gz, http://tkltrans.sf.net/magyar/betus94okx.txt.gz http://tkltrans.sf.net/magyar/sav2koz.txt.gz The conclusion: The checking of a 30 million word size corpus proved, that the words, that are automatically created compound words, contain approximately 10% wrong words of the above types. Automatic word compounding is a quick a dirty mechanizm, that is not capable to create quality word lists and therefore quality spell checking. Manually created word lists, if carefully created, tend to contain less than 0.5% wrong words. The number of words. ------------------- Here the in reality bad words: [EMAIL PROTECTED] nagy_fajlok]$ wc el_bad/* 7889 7889 95583 el_bad/bbbad2.txt 38175 38175 467018 el_bad/betus94okx.txt 8401 8401 142604 el_bad/sav2koz.txt -- long words , over 15 chars long 54465 54465 705205 total Here the words, that the checker using compounder thinks, they are good: [EMAIL PROTECTED] nagy_fajlok]$ wc NL_jo/* 64054 64054 802530 NL_jo/bbbad2.txt 341204 341204 4309353 NL_jo/betus94okx.txt 135044 135044 2302654 NL_jo/sav2koz.txt -- long words , over 15 chars long 540302 540302 7414537 total The shorter the words, the more catastrophic the error rate. I assume, that the results are in German analogous, because the first investigations showed that quite clearly, if I have time, I will look also into that somewhat deeper. Regards: Eleonora --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
