Hi,
Will the end user (of oowriter) notice any difference whether the Myspell/Hunspell dictionary uses affix flags or not? Is there any added value in providing a spelling dictionary with flags, as opposed to a flat list of word forms?
The answer is possibly. It depends on how large the "working set" is for the language, how much memory is available, etc.
The original reason affixes are used to compress the memory footprint of the working set of the language. Since an affix can be either a prefix or a suffix (or both) and can be represented by simply rules, one entry in the table with the correct affixes can in fact represent many many other entries that would have to exist in a flat word file.
In fact some languages are so affix heavy that no reasonable memory allocation could keep the entire working set of the language in memory.
Please note, although affixes and affix rules are often used to add plural forms, indicate tense, etc, there is typically not a one-to- one mapping between affixes and plurals, past tense, gender, etc. That said, manly languages do attempt to do just that.
en_US does not do that. The affix file behind en_US was actually deigned to identify and use only the most common prefixes and suffixes and encode them as rules.
For example, en_US.dic contains "man/USY" where "/U" adds un-, "/S" adds -s, and "/Y" adds -ly. But the plural "men" is not created by flags, but listed separately, as "men/MS". (I'm not sure why "mans" and "mens" are OK, but English is not my native language. "Manly" is covered by man/Y but "manly" is also listed separately, perhaps in order to cover "unmanly".)
Yes, I agree mens and mans should be "men's" and "man's" and not "mens" and "mans". But that is not the fault of the affix flags. There is a program called "munch" that takes a long flat file of words and an affix file and then creates the hopefully much smaller dictionary .dic file. The only rule in the creation of the .dic file is that it can not create any word when expanded not in the original flat word file. There is an "unmunch" program which will take a .dic and .aff file and expand it and (after sorting uniquely) will reproduce exactly the original flat word list with nothing missing or added.
Some languages where it is impossible to create a flat word list in the first place, actually hand assign flags to root words to try to "stochastically cover" the language's working set of words.
English does not do this.
Apparently, the English example cannot be used to derive the basic form (singular: man) from the plural (men), since these are listed as two separate entries.
That is not its original purpose. Again the affix entries were originally just to identify and remove common prefixes and suffixes from words to effectively reduce the size of the memory footprint needed to store the language.
Of the affix files I've looked at (da, de, en, no, sv), only the Danish has comments for each flag. The others look more like object code than source code. Were they converted automatically from older ispell affix files? Is the distribution of the derivate without comments allowed by the licenses (GPL? LGPL?) of the original files? Does anybody maintain .dic files manually, knowing the flags by heart, or are .dic files always generated by the likes of ispell's munchlist program?
See the "munch" and "unmunch" programs that came with "MySpell" and are propably included with Hunspell as well. Also there are perl scripts that will convert ispell to hunspell/myspell affix files.
As the affix flags often do indicate plurals, genitives and other grammatical patterns of words, it would seem natural to combine this with grammar checking and thesaurus, but no such connection exists with today's ispell/myspell/hunspell.
They can but they were originally not meant to do that. The english affix file is close but as your example points out, there is no affix to go from man -> men but there is for thing -> things. In fact, with the help of an exception list, you could probably make the en_US.aff file handle most of those cases easily.
For example, in the current English thesaurus, "man" is an antonym of "woman", but "men" is not an antonym of "women".
That is based on the source of the thesaurus, not the affix file. It could be based on the affix file but Thesaurus and Grammar checking applications came much much later than the simple spell checking applications and so they were not designed in concert. That holds true for hyphenation as well.
German "Buch" has synonyms in Lektüre, Titel, Werk, but "Bücher" and "Büchern" has no synonym. Has any attempts been made to create a unified higher level dictionary format, from which a spelling dictionary, hyphenation, and thesaurus can be generated?
Good idea but for some languages this might be impossible. Kevin --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
