Hi Eleonora,
ge wrote:
Dear All,
The investigation below will be interesting for those, whose language
actively uses compound words (Hungarian, German, Dutch, Swedish, ...)
I now closed the investigation of Hungarian compound words creation.
The results are in
http://tkltrans.sourceforge.net/tklspell/compound.htm#c01
The full list of bad words is on
http://tkltrans.sf.net/magyar/bbbad2.txt.gz,
http://tkltrans.sf.net/magyar/betus94okx.txt.gz
http://tkltrans.sf.net/magyar/sav2koz.txt.gz
The conclusion:
The checking of a 30 million word size corpus proved, that the words, that are
automatically created compound words, contain approximately 10% wrong words of
the above types. Automatic word compounding is a quick a dirty mechanizm, that
is not capable to create quality word lists and therefore quality spell
checking. Manually created word lists, if carefully created, tend to contain
less than 0.5% wrong words.
The number of words.
-------------------
Here the in reality bad words:
[EMAIL PROTECTED] nagy_fajlok]$ wc el_bad/*
7889 7889 95583 el_bad/bbbad2.txt
38175 38175 467018 el_bad/betus94okx.txt
8401 8401 142604 el_bad/sav2koz.txt -- long words , over 15 chars long
54465 54465 705205 total
Here the words, that the checker using compounder thinks, they are good:
[EMAIL PROTECTED] nagy_fajlok]$ wc NL_jo/*
64054 64054 802530 NL_jo/bbbad2.txt
341204 341204 4309353 NL_jo/betus94okx.txt
135044 135044 2302654 NL_jo/sav2koz.txt -- long words , over 15 chars long
540302 540302 7414537 total
The shorter the words, the more catastrophic the error rate.
It might then be a good idea if the spell checker would reject guessed
compounds below a certain minimum length (configurable in the affix file).
I assume, that the results are in German analogous, because the first
investigations showed that quite clearly, if I have time, I
will look also into that somewhat deeper.
I notice that the German examples show mostly wrong compounds that are
misspellings of other words. Maybe that list is not representative, but
such errors would be more common and are more difficult to spot by the
user.
So a possible improvement could be to disqualify a guessed compound if
it is too similar to a word that is actually in the word list. The
existing suggestion mechanism could be used to determine this.
Or maybe such mechanisms have already been implemented?
--
Vriendelijke groet,
Simon Brouwer.
| nl.openoffice.org | www.opentaal.org |
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]