Dear Simon,

>>
I think *that* would be throwing away the child. In languages like Dutch 
it is so easy to form a new but perfectly valid word by compounding, it 
is impossible to include all the possible combinations in a word list.
<<

All is impossible, I agree. But it is not impossible to find 99.99% and then 
you have an acceptable error rate and an acceptable hit rate.

>>It might an idea to identify problem cases by running the list of 
known-good words through the suggestion mechanism, and making a list of 
all the variations that are accepted (only) using the mechanical 
compound mechanism. This list could then be reviewed and the words that 
are incorrectly spelled and/or nonsensical placed on a "reject list".
<<

Simon, this is up to you for Dutch. However, bad words are as a minimum 4.9 
milliard words (you can see in my study why), therefore I decided not to handle 
bad words. My life is not long enough to handle them, and also, it would bring 
no useful result. It is an erroneous technology and way of thinking, when you 
assume, that you can work with them. They are just too much. I just illustrate 
a few of them in my table, that's all.

I did the selection of good words for Hungarian, and I can tell you, it was a 
LOT of work. 

If you do that, I strongly advice to use a mechanical compounder for 
preselection. Wrong words after preselection are thrown away. After that I 
created word lists with different word length, up to 8 chars, 8-10 chars, 11-15 
chars, and above 15 chars. All lists I checked with yahoo/google, each word, 
and all length groups had now 2 groups, google/yahoo found and google/yahoo not 
found. These tricks helped me to spped up from 300 words/hour to 6000 
words/hour.

Machine compounding helped me a lot to filter the web corpus. It is a useful 
technology, but the error rate it creates in unacceptable for quality spell 
checking. Tricks do not help, if the pig remains in the room- it will stink 
there, no matter, what you try.

Regards, Eleonora

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to