Hi Eleonora,

ge wrote:
Dear All,

The investigation below will be interesting for those, whose language actively uses compound words (Hungarian, German, Dutch, Swedish, ...)

I now closed the investigation of Hungarian compound words creation.

The results are in http://tkltrans.sourceforge.net/tklspell/compound.htm#c01

The full list of bad words is on
http://tkltrans.sf.net/magyar/bbbad2.txt.gz, 
http://tkltrans.sf.net/magyar/betus94okx.txt.gz 
http://tkltrans.sf.net/magyar/sav2koz.txt.gz

The conclusion:

The checking of a 30 million word size corpus proved, that the words, that are 
automatically created compound words, contain approximately 10% wrong words of 
the above types. Automatic word compounding is a quick a dirty mechanizm, that 
is not capable to create quality word lists and therefore quality spell 
checking. Manually created word lists, if carefully created, tend to contain 
less than 0.5% wrong words.

The number of words.
-------------------
Here the in reality bad words:

[EMAIL PROTECTED] nagy_fajlok]$ wc el_bad/*
  7889   7889  95583 el_bad/bbbad2.txt
 38175  38175 467018 el_bad/betus94okx.txt
  8401   8401 142604 el_bad/sav2koz.txt  -- long words , over 15 chars long
 54465  54465 705205 total

Here the words, that the checker using  compounder thinks, they are good:

[EMAIL PROTECTED] nagy_fajlok]$ wc NL_jo/*
  64054   64054  802530 NL_jo/bbbad2.txt
 341204  341204 4309353 NL_jo/betus94okx.txt
 135044  135044 2302654 NL_jo/sav2koz.txt   -- long words , over 15 chars long
 540302  540302 7414537 total
The shorter the words, the more catastrophic the error rate.

It might then be a good idea if the spell checker would reject guessed compounds below a certain minimum length (configurable in the affix file).

I assume, that the results are in German analogous, because the first
investigations showed that quite clearly, if I have time, I will look also into that somewhat deeper.

I notice that the German examples show mostly wrong compounds that are misspellings of other words. Maybe that list is not representative, but such errors would be more common and are more difficult to spot by the user. So a possible improvement could be to disqualify a guessed compound if it is too similar to a word that is actually in the word list. The existing suggestion mechanism could be used to determine this.

Or maybe such mechanisms have already been implemented?

--
Vriendelijke groet,
Simon Brouwer.

| nl.openoffice.org | www.opentaal.org |

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to