Re: [lingu-dev] Re: Error in affix handling

Kevin B. Hendricks Mon, 07 Mar 2005 05:14:02 -0800

Hi,

then the next question:
if i have to make sfx-es as "N", what the reaosn is to have PFX entries at all? In Latvian - pfx changes the meaning of the word (some times - completely), not changing the way and amount of product words of the declination of the word.

At a basic level, a spelling checker simply takes an unknown word and look it up in a list of commonly used words (all correctly spelled). Unfortunately for many languages the list of commonly used words is simply too large to be searched or accessed easily with a reasonable memory footprint and access speed. Luckily many of those same languages use prefixes and or suffixes (sometimes in combination) on a much smaller list of root words to create many of its commonly used words.

So all an .aff file is used for is to identify some of the most commonly used prefixes and suffixes so that a much smaller set of root words with affix flags can be used to effectively store a much longer list of commonly used words.

That is the whole concept behind ispell which myspell has tried to adopt.

It actually does not matter what adding a prefix or a suffix actually does to the root word (that would be the domain of a grammar checker) as long as a correctly spelled new word is made from a correctly spelled root word and its defined affixes.

At the moment i am doing so: munch with full key asset, if the result looks awful, remove pfxes and re-munch again. This helps. Of course, that does not work if you munch full list of words - you have to relay on the good will of munch.

Huh? Again I am sorry but I can't seem to follow exactly what problem you are having?

Perhaps if I explain how munch and unmunch are meant to be used.

The way to use munch is to take a long long list of commonly used but correctly spelled words (call this the language's "working set") and then using the identified prefixes and suffixes from the .aff file to identify and properly compress that "working set" into a new shorter list of correctly spelled root words with affix flags (a .dic file). unmunch can then be used on that .dic file and .aff file to recreate the exact same "working set" of commonly used words with NO additional words created.

Many people with affix heavy languages try to skip the generation of the "working set" and use "unmunch" and their .aff info to try to build a working set up. You can do this BUT then you must remove all bad words from the working set (pass it through another known good spell checker or manually check it) and then run "munch" on the newly shortened "working set" to create a final, good .dic file.

Please understand a "working set" is not the complete set of all words in a language. It is the subset of words that are most commonly found and used. (i.e. a spell checker is NOT an unabridged dictionary!). That is why using a word harvester (based on a web crawler tool or a word doc parser) of documents commonly found at the intro university level, is a good way to identify a high enough level working set for a language.

Hope something here helps,

Kevin


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [lingu-dev] Re: Error in affix handling

Reply via email to