On 2006-06-30 at 21:17 +0200 Daniel Naber sent off:
On Freitag 30 Juni 2006 11:42, Simon Brouwer wrote:

It might an idea to identify problem cases by running the list of
known-good words through the suggestion mechanism, and making a list of
all the variations that are accepted (only) using the mechanical
compound mechanism. This list could then be reviewed and the words that
are incorrectly spelled and/or nonsensical placed on a "reject list".

What I did is this: I collected (and automatically generated) similar German words like Hand, Hund. I then replaced Hand by Hund and vice versa in a large list of compounds. Then I checked whether results like "Treuhund" are accepted. These cases have been reported to Björn Jacke, the maintainer of the German hunspell list.

aditionally I check every compoundable word for commonness against a big list og words which also contains compounds. If there are compoundable words, which only occur in very few compound words, I will take the few compound words into the dictionary instead of taking the first part of the compound into the dictionary as compoundable word. Adding compoundable words into the dictionary should be done very sensitive. It might also happen that silly or bogous words are being acceped: if "Zieh" is accepted as compoundable word it will result in "Ziehren" to be corect. Strictly speaking there might be a "pulling reindeer" but usually this is a typo. Cases like this and cases like Daniel mentions have to be put into a blacklist which has to be flagged with hunspell's FORBIDDENWORD flag. Finding out about those cases can be partly done by a script, that generates typos automatically but also has to be done during the buildup of the dictionary by grepping for substrings of the newly added words in huge wordlists and taking a look at each match for correctnes if the to be added compound word is still correct after that or if other forms are created which are incorrect: Arbets- is a common compoundable word, before adding it, grep a huge word list for "Arbeit" (the word without any suffix) ... you will find Arbeitgeber. Adding Arbeits- as compoundable word would make Arbeitsgeber a correct word, so you have to put Arbeitsgeber with the FORBIDDENWORD flag into your blacklist, including all affix flags so that other variants of the bogous "Arbeitsgeber" are blacklisted, too. There are many cases similar to this, where you find out by grepping that new compoundable words produce more or less nasty typos.

Bjoern

Attachment: pgpAVUGY750bN.pgp
Description: PGP signature

Reply via email to