Re: [lingu-dev] compound words

Simon Brouwer Fri, 30 Jun 2006 00:52:16 -0700

Hi Eleonora,

ge wrote:

Dear All,

The investigation below will be interesting for those, whose languageactively uses compound words (Hungarian, German, Dutch, Swedish, ...)


I now closed the investigation of Hungarian compound words creation.

The results are inhttp://tkltrans.sourceforge.net/tklspell/compound.htm#c01


The full list of bad words is on
http://tkltrans.sf.net/magyar/bbbad2.txt.gz, 
http://tkltrans.sf.net/magyar/betus94okx.txt.gz 
http://tkltrans.sf.net/magyar/sav2koz.txt.gz

The conclusion:

The checking of a 30 million word size corpus proved, that the words, that are 
automatically created compound words, contain approximately 10% wrong words of 
the above types. Automatic word compounding is a quick a dirty mechanizm, that 
is not capable to create quality word lists and therefore quality spell 
checking. Manually created word lists, if carefully created, tend to contain 
less than 0.5% wrong words.

The number of words.
-------------------
Here the in reality bad words:

[EMAIL PROTECTED] nagy_fajlok]$ wc el_bad/*
  7889   7889  95583 el_bad/bbbad2.txt
 38175  38175 467018 el_bad/betus94okx.txt
  8401   8401 142604 el_bad/sav2koz.txt  -- long words , over 15 chars long
 54465  54465 705205 total

Here the words, that the checker using  compounder thinks, they are good:

[EMAIL PROTECTED] nagy_fajlok]$ wc NL_jo/*
  64054   64054  802530 NL_jo/bbbad2.txt
 341204  341204 4309353 NL_jo/betus94okx.txt
 135044  135044 2302654 NL_jo/sav2koz.txt   -- long words , over 15 chars long
 540302  540302 7414537 total

The shorter the words, the more catastrophic the error rate.

It might then be a good idea if the spell checker would reject guessedcompounds below a certain minimum length (configurable in the affix file).

I assume, that the results are in German analogous, because the first
investigations showed that quite clearly, if I have time, Iwill look also into that somewhat deeper.

I notice that the German examples show mostly wrong compounds that aremisspellings of other words. Maybe that list is not representative, butsuch errors would be more common and are more difficult to spot by theuser.So a possible improvement could be to disqualify a guessed compound ifit is too similar to a word that is actually in the word list. Theexisting suggestion mechanism could be used to determine this.


Or maybe such mechanisms have already been implemented?

--
Vriendelijke groet,
Simon Brouwer.

| nl.openoffice.org | www.opentaal.org |

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [lingu-dev] compound words

Reply via email to