Re: [lingu-dev] Hunspell affixes

Kevin B. Hendricks Sat, 02 Dec 2006 08:08:00 -0800

Hi,

Will the end user (of oowriter) notice any difference whether the
Myspell/Hunspell dictionary uses affix flags or not?  Is there any
added value in providing a spelling dictionary with flags, as
opposed to a flat list of word forms?

The answer is possibly. It depends on how large the "working set" isfor the language, how much memory is available, etc.

The original reason affixes are used to compress the memory footprintof the working set of the language. Since an affix can be either aprefix or a suffix (or both) and can be represented by simply rules,one entry in the table with the correct affixes can in fact representmany many other entries that would have to exist in a flat word file.

In fact some languages are so affix heavy that no reasonable memoryallocation could keep the entire working set of the language in memory.

Please note, although affixes and affix rules are often used to addplural forms, indicate tense, etc, there is typically not a one-to-one mapping between affixes and plurals, past tense, gender, etc.That said, manly languages do attempt to do just that.

en_US does not do that. The affix file behind en_US was actuallydeigned to identify and use only the most common prefixes andsuffixes and encode them as rules.

For example, en_US.dic contains "man/USY" where "/U" adds un-,
"/S" adds -s, and "/Y" adds -ly.  But the plural "men" is not
created by flags, but listed separately, as "men/MS".  (I'm not
sure why "mans" and "mens" are OK, but English is not my native
language.  "Manly" is covered by man/Y but "manly" is also listed
separately, perhaps in order to cover "unmanly".)

Yes, I agree mens and mans should be "men's" and "man's" and not"mens" and "mans". But that is not the fault of the affix flags.There is a program called "munch" that takes a long flat file ofwords and an affix file and then creates the hopefully much smallerdictionary .dic file. The only rule in the creation of the .dic fileis that it can not create any word when expanded not in the originalflat word file. There is an "unmunch" program which will take a .dicand .aff file and expand it and (after sorting uniquely) willreproduce exactly the original flat word list with nothing missing oradded.

Some languages where it is impossible to create a flat word list inthe first place, actually hand assign flags to root words to try to"stochastically cover" the language's working set of words.

English does not do this.

Apparently, the English example cannot be used to derive the basic
form (singular: man) from the plural (men), since these are listed
as two separate entries.

That is not its original purpose. Again the affix entries wereoriginally just to identify and remove common prefixes and suffixesfrom words to effectively reduce the size of the memory footprintneeded to store the language.


Of the affix files I've looked at (da, de, en, no, sv), only the
Danish has comments for each flag. The others look more like
object code than source code. Were they converted automatically
from older ispell affix files? Is the distribution of the derivate
without comments allowed by the licenses (GPL? LGPL?) of the
original files?  Does anybody maintain .dic files manually,
knowing the flags by heart, or are .dic files always generated by
the likes of ispell's munchlist program?

See the "munch" and "unmunch" programs that came with "MySpell" andare propably included with Hunspell as well. Also there are perlscripts that will convert ispell to hunspell/myspell affix files.


As the affix flags often do indicate plurals, genitives and other
grammatical patterns of words, it would seem natural to combine
this with grammar checking and thesaurus, but no such connection
exists with today's ispell/myspell/hunspell.

They can but they were originally not meant to do that. The englishaffix file is close but as your example points out, there is no affixto go from man -> men but there is for thing -> things.In fact, with the help of an exception list, you could probably makethe en_US.aff file handle most of those cases easily.

For example, in the
current English thesaurus, "man" is an antonym of "woman", but
"men" is not an antonym of "women".

That is based on the source of the thesaurus, not the affix file. Itcould be based on the affix file but Thesaurus and Grammar checkingapplications came much much later than the simple spell checkingapplications and so they were not designed in concert. That holdstrue for hyphenation as well.

German "Buch" has synonyms in
Lektüre, Titel, Werk, but "Bücher" and "Büchern" has no synonym.
Has any attempts been made to create a unified higher level
dictionary format, from which a spelling dictionary, hyphenation,
and thesaurus can be generated?


Good idea but for some languages this might be impossible.

Kevin


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [lingu-dev] Hunspell affixes

Reply via email to