Hi,

Will the end user (of oowriter) notice any difference whether the
Myspell/Hunspell dictionary uses affix flags or not?  Is there any
added value in providing a spelling dictionary with flags, as
opposed to a flat list of word forms?

The answer is possibly. It depends on how large the "working set" is for the language, how much memory is available, etc.

The original reason affixes are used to compress the memory footprint of the working set of the language. Since an affix can be either a prefix or a suffix (or both) and can be represented by simply rules, one entry in the table with the correct affixes can in fact represent many many other entries that would have to exist in a flat word file.

In fact some languages are so affix heavy that no reasonable memory allocation could keep the entire working set of the language in memory.

Please note, although affixes and affix rules are often used to add plural forms, indicate tense, etc, there is typically not a one-to- one mapping between affixes and plurals, past tense, gender, etc. That said, manly languages do attempt to do just that.

en_US does not do that. The affix file behind en_US was actually deigned to identify and use only the most common prefixes and suffixes and encode them as rules.

For example, en_US.dic contains "man/USY" where "/U" adds un-,
"/S" adds -s, and "/Y" adds -ly.  But the plural "men" is not
created by flags, but listed separately, as "men/MS".  (I'm not
sure why "mans" and "mens" are OK, but English is not my native
language.  "Manly" is covered by man/Y but "manly" is also listed
separately, perhaps in order to cover "unmanly".)

Yes, I agree mens and mans should be "men's" and "man's" and not "mens" and "mans". But that is not the fault of the affix flags. There is a program called "munch" that takes a long flat file of words and an affix file and then creates the hopefully much smaller dictionary .dic file. The only rule in the creation of the .dic file is that it can not create any word when expanded not in the original flat word file. There is an "unmunch" program which will take a .dic and .aff file and expand it and (after sorting uniquely) will reproduce exactly the original flat word list with nothing missing or added.

Some languages where it is impossible to create a flat word list in the first place, actually hand assign flags to root words to try to "stochastically cover" the language's working set of words.
English does not do this.

Apparently, the English example cannot be used to derive the basic
form (singular: man) from the plural (men), since these are listed
as two separate entries.

That is not its original purpose. Again the affix entries were originally just to identify and remove common prefixes and suffixes from words to effectively reduce the size of the memory footprint needed to store the language.


Of the affix files I've looked at (da, de, en, no, sv), only the
Danish has comments for each flag. The others look more like
object code than source code. Were they converted automatically
from older ispell affix files? Is the distribution of the derivate
without comments allowed by the licenses (GPL? LGPL?) of the
original files?  Does anybody maintain .dic files manually,
knowing the flags by heart, or are .dic files always generated by
the likes of ispell's munchlist program?

See the "munch" and "unmunch" programs that came with "MySpell" and are propably included with Hunspell as well. Also there are perl scripts that will convert ispell to hunspell/myspell affix files.


As the affix flags often do indicate plurals, genitives and other
grammatical patterns of words, it would seem natural to combine
this with grammar checking and thesaurus, but no such connection
exists with today's ispell/myspell/hunspell.

They can but they were originally not meant to do that. The english affix file is close but as your example points out, there is no affix to go from man -> men but there is for thing -> things. In fact, with the help of an exception list, you could probably make the en_US.aff file handle most of those cases easily.

For example, in the
current English thesaurus, "man" is an antonym of "woman", but
"men" is not an antonym of "women".

That is based on the source of the thesaurus, not the affix file. It could be based on the affix file but Thesaurus and Grammar checking applications came much much later than the simple spell checking applications and so they were not designed in concert. That holds true for hyphenation as well.


German "Buch" has synonyms in
Lektüre, Titel, Werk, but "Bücher" and "Büchern" has no synonym.
Has any attempts been made to create a unified higher level
dictionary format, from which a spelling dictionary, hyphenation,
and thesaurus can be generated?

Good idea but for some languages this might be impossible.

Kevin


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to