Dear Simon, >> I think *that* would be throwing away the child. In languages like Dutch it is so easy to form a new but perfectly valid word by compounding, it is impossible to include all the possible combinations in a word list. <<
All is impossible, I agree. But it is not impossible to find 99.99% and then you have an acceptable error rate and an acceptable hit rate. >>It might an idea to identify problem cases by running the list of known-good words through the suggestion mechanism, and making a list of all the variations that are accepted (only) using the mechanical compound mechanism. This list could then be reviewed and the words that are incorrectly spelled and/or nonsensical placed on a "reject list". << Simon, this is up to you for Dutch. However, bad words are as a minimum 4.9 milliard words (you can see in my study why), therefore I decided not to handle bad words. My life is not long enough to handle them, and also, it would bring no useful result. It is an erroneous technology and way of thinking, when you assume, that you can work with them. They are just too much. I just illustrate a few of them in my table, that's all. I did the selection of good words for Hungarian, and I can tell you, it was a LOT of work. If you do that, I strongly advice to use a mechanical compounder for preselection. Wrong words after preselection are thrown away. After that I created word lists with different word length, up to 8 chars, 8-10 chars, 11-15 chars, and above 15 chars. All lists I checked with yahoo/google, each word, and all length groups had now 2 groups, google/yahoo found and google/yahoo not found. These tricks helped me to spped up from 300 words/hour to 6000 words/hour. Machine compounding helped me a lot to filter the web corpus. It is a useful technology, but the error rate it creates in unacceptable for quality spell checking. Tricks do not help, if the pig remains in the room- it will stink there, no matter, what you try. Regards, Eleonora --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
