Hi Bjoern,
Bjoern JACKE schreef:
On 2006-06-30 at 21:17 +0200 Daniel Naber sent off:
On Freitag 30 Juni 2006 11:42, Simon Brouwer wrote:
It might an idea to identify problem cases by running the list of
known-good words through the suggestion mechanism, and making a list of
all the variations that are accepted (only) using the mechanical
compound mechanism. This list could then be reviewed and the words that
are incorrectly spelled and/or nonsensical placed on a "reject list".
What I did is this: I collected (and automatically generated) similar
German words like Hand, Hund. I then replaced Hand by Hund and vice
versa in a large list of compounds. Then I checked whether results
like "Treuhund" are accepted. These cases have been reported to Björn
Jacke, the maintainer of the German hunspell list.
aditionally I check every compoundable word for commonness against a
big list og words which also contains compounds. If there are
compoundable words, which only occur in very few compound words, I
will take the few compound words into the dictionary instead of taking
the first part of the compound into the dictionary as compoundable
word. Adding compoundable words into the dictionary should be done
very sensitive. It might also happen that silly or bogous words are
being acceped: if "Zieh" is accepted as compoundable word it will
result in "Ziehren" to be corect. Strictly speaking there might be a
"pulling reindeer" but usually this is a typo. Cases like this and
cases like Daniel mentions have to be put into a blacklist which has
to be flagged with hunspell's FORBIDDENWORD flag. Finding out about
those cases can be partly done by a script, that generates typos
automatically but also has to be done during the buildup of the
dictionary by grepping for substrings of the newly added words in huge
wordlists and taking a look at each match for correctnes if the to be
added compound word is still correct after that or if other forms are
created which are incorrect:
Arbets- is a common compoundable word, before adding it, grep a huge
word list for "Arbeit" (the word without any suffix) ... you will find
Arbeitgeber. Adding Arbeits- as compoundable word would make
Arbeitsgeber a correct word, so you have to put Arbeitsgeber with the
FORBIDDENWORD flag into your blacklist, including all affix flags so
that other variants of the bogous "Arbeitsgeber" are blacklisted, too.
There are many cases similar to this, where you find out by grepping
that new compoundable words produce more or less nasty typos.
Bjoern
Thanks for this useful explanation! I will take your recommendations to
heart when implementing compounding in the Dutch spell checker files.
Did you do the checking manually, or did you use some software for this?
--
Vriendelijke groet,
Simon Brouwer.
| nl.openoffice.org | www.opentaal.org |
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]