Hi, Please is there any way or command that can be used to get list of all valid words in Hunspell library, both the ones in the dictionary file and the ones generated using affix rule. Secondly, is there any way to let hunspell know that two the same combined character write in different way are the same.Example is the character " ọ́ " can be written by first write " o " and add under dot and tone mark or first write " ọ " and add tone mark or first write " ó " and add under dot to it.
Regards, Jeje --- On Wed, 2/25/09, Németh László <[email protected]> wrote: From: Németh László <[email protected]> Subject: [lingu-dev] Automatic dictionary development and optimization (Re: Questions regarding Hunspell format for new Norwegian dictionaries) To: [email protected], "Karl Ove Hufthammer" <[email protected]> Date: Wednesday, February 25, 2009, 7:12 AM Hi, 2009/2/6 Karl Ove Hufthammer <[email protected]>: > ... could (should) I just use affixcompress on the words in the third column > to generate a dictionary file. It seems to work well. Or is there a way to use > the information on each word to *automatically* improve the suggestions (I > will of course also add suggestion hints in the affix file manually), or reduce > the dictionary size, or improve the speed for lookups and suggestions? Affixcompress script is for compression of a huge word list, searching potential stems and affixes. The most important results of the affix compression are the smaller memory footprint and shorter loading time. (In fact, affix rich languages need affix compression.) Compressed dictionaries may have slower lookups for the compressed words. The time consuming dictionary based (ngram and phonetic) suggestion is much faster with smaller dic files (suggestion speed is the bottleneck during the normal usage of the spelling dictionaries). Example Generating a compressed dictionary from the standard English dictionary (/usr/share/dict/words) of the Linux: $ LC_ALL=C sort /usr/share/dict/words >en $ hunspell-1.2.8/src/tools/affixcompress en 1000 The compressed dictionary contains only 30 thousand words in the file en.dic instead of the 99 thousand words of the original word list. The file en.aff contains the predefined 1000 affixes (but it misses some of the default settings, SET character encoding, TRY definition etc.). Alias compression is an optimization method for the dictionaries of the affix rich (agglutinative) languages, but it reduces the memory usage and improve the affix analysis, too: $ hunspell-1.2.8/src/tools/makealias en.dic en.aff output: en_alias.dic, en_alias.aff Memory usage (RSS, VSZ fields are in kB): $ hunspell-1.2.8/src/tools/.libs/lt-hunspell -d en0 & $ hunspell-1.2.8/src/tools/.libs/lt-hunspell -d en & $ hunspell-1.2.8/src/tools/.libs/lt-hunspell -d en_alias & $ ps -eo pid,ppid,rss,vsize,pcpu,pmem,cmd -ww --sort=pid | sed -n '1p;/lt-hunspell/p' PID PPID RSS VSZ %CPU %MEM CMD 6767 6314 5564 8956 0.1 0.2 hunspell-1.2.8/src/tools/.libs/lt-hunspell -d en0 6768 6314 3444 6776 0.0 0.1 hunspell-1.2.8/src/tools/.libs/lt-hunspell -d en 6769 6314 3160 6492 0.1 0.1 hunspell-1.2.8/src/tools/.libs/lt-hunspell -d en_alias (The "en0" dictionary is the original word list without affix compression: $ cp /usr/share/dict/words en0.dic $ touch en0.aff) Also Hunspell 1.2.x has a ZIP-like compression format for dictionary compression: $ hunspell-1.2.8/src/tools/hzip en.* en_alias.* $ ls -lh en.* en_alias.* -rw-r--r-- 1 laci laci 29K 2009-02-19 15:16 en.aff -rw-r--r-- 1 laci laci 9,5K 2009-02-19 16:04 en.aff.hz -rw-r--r-- 1 laci laci 425K 2009-02-19 15:16 en.dic -rw-r--r-- 1 laci laci 197K 2009-02-19 16:04 en.dic.hz -rw-r--r-- 1 laci laci 166K 2009-02-19 15:36 en_alias.aff -rw-r--r-- 1 laci laci 74K 2009-02-19 16:04 en_alias.aff.hz -rw-r--r-- 1 laci laci 311K 2009-02-19 15:36 en_alias.dic -rw-r--r-- 1 laci laci 131K 2009-02-19 16:04 en_alias.dic.hz The hzip compressed en_alias dictionary needs 205 kB disk space. (The size of the original Linux English word list was 910 kB). Measuring suggestion speed (The affix file of the dictionaries was extended with the following header: TRY qwertzuiopasdfghjklyxcvbnm'- WORDCHARS '-) Generate 10-character length random misspelled words from the Linux words: $ sed -n '1~500p' /usr/share/dict/words | tr -d '\n' | grep -o '.........' >bad.txt $ wc -l <bad.txt 184 $ tail bad.txt 'sunderta kersunoff iciallyup rootedvea ledvindic ationwage redwavere dwhiniest wintrywra pzipper's $ cat bad.txt | time hunspell-1.2.8/src/tools/hunspell -d en ... & dwhiniest 6 0: whiniest, d whiniest, shiniest, whinniest, grainiest, brainiest & wintrywra 4 0: wintry, wintrier, wintriest, wintery & pzipper's 6 0: zipper's, p zipper's, clipper's, shipper's, skipper's, slipper's 13.75user 0.01system 0:13.79elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+4945minor)pagefaults 0swaps $ cat bad.txt | time hunspell-1.2.8/src/tools/hunspell -d en ... & dwhiniest 6 0: whiniest, d whiniest, whinniest, whinnies, Dniester, daintiest & wintrywra 4 0: wintrier, wintriest, wintergreen, Winthrop & pzipper's 6 0: zipper's, p zipper's, slipper's, skipper's, Dipper's, kipper's 4.20user 0.01system 0:04.22elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+4374minor)pagefaults 0swaps $ cat bad.txt | time hunspell-1.2.8/src/tools/hunspell -d en_alias & dwhiniest 6 0: whiniest, d whiniest, whinniest, whinnies, Dniester, daintiest & wintrywra 4 0: wintrier, wintriest, wintergreen, Winthrop & pzipper's 6 0: zipper's, p zipper's, slipper's, skipper's, Dipper's, kipper's 4.17user 0.01system 0:04.18elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+4297minor)pagefaults 0swaps So suggestion time depends strongly on the word count of the dic file. Languages with complex morphology can use the second-level affixation of Hunspell. There is a new tool "doubleaffixcompress" (http://downloads.sourceforge.net/hunspell/doubleaffixcompress) to compress the output dictionary of the affixcompress script or other Hunspell dictionaries using second-level affixes. For example, on the old en_US dictionary of Openoffice.org we got 50% compression rate: $ doubleaffixcompress en_US $ wc -l en_US.dic new_en_US.dic 62157 en_US.dic 30442 new_en_US.dic $ grep abolish en_US.dic abolisher/M abolish/LZRSDG abolishment/MS $ grep abolish new_en_US.dic abolish/5193,6535,64991,64993,64995,64996,64997,65001 $ grep '\(5193\|6535\)' new_en_US.aff SFX 5193 Y 1 SFX 5193 0 er/64999 . SFX 6535 Y 1 SFX 6535 0 ment/64997,64999 . A more important result on the (too big) he_IL dictionary. (This dictionary recognizes more than 100 million Hebrew word forms): $ LC_ALL=C doubleaffixcompress he_IL $ wc he_IL.dic new_he_IL.dic 329237 328996 3212113 he_IL.dic 37913 37879 1940612 new_he_IL.dic $ LC_ALL=C ~/hunspell-1.2.8/src/tools/makealias new_he_IL.{dic,aff} output: new_he_IL_alias.dic, new_he_IL_alias.aff Memory usage has been reduced from 19 MB to 5.5 MB by doubleaffixcompress and makealias. 2009/2/6 Karl Ove Hufthammer <[email protected]>: > Hi! > > I couldn't find a mailing list for questions regarding Hunspell, so I'm writing > to you. Please feel free to direct me the the relevant mailing list or forum > instead of answering me directly. I will post your letter to the Lingucomponent development list of OpenOffice.org with a detailed example, > > I am about to create a new spellchecker for the Norwegian Nynorsk language > (and possibly Norwegian Bokmål too), based on Hunspell. However, I have some > questions on how to best proceed. > > We are lucky enough to have access to (GPL 3+-based) fullform dictionary for > Norwegian, which most other languages using Hunspell doesn't seem to have. > But I'm not sure how to best make use of the information in this databae. Here > is an example output, for the word «hoppe»: > > 37933 hoppe hoppe subst fem appell eint ub > 37933 hoppe hoppa subst fem appell eint ub > 37933 hoppe hoppa subst fem appell eint bu > 37933 hoppe hopper subst fem appell fl ub > 37933 hoppe hoppor subst fem appell fl ub > 37933 hoppe hoppene subst fem appell fl bu > 37933 hoppe hoppone subst fem appell fl bu > 37934 hoppe hoppe verb inf > 37934 hoppe hoppa verb inf > 37934 hoppe hoppar verb pres > 37934 hoppe hoppast verb inf pres st-form > 37934 hoppe hoppa verb pret > 37934 hoppe hoppa verb perf-part > 37934 hoppe hoppa adj <perf-part> nøyt ub eint > 37934 hoppe hoppa adj <perf-part> m/f ub eint > 37934 hoppe hoppa adj <perf-part> bu eint > 37934 hoppe hoppa adj <perf-part> fl > 37934 hoppe hoppande adj <pres-part> > 37934 hoppe hopp verb imp > 37934 hoppe hoppe verb imp > 37934 hoppe hoppa verb imp > > (Here the code «subst» means noun. And yes, we *do* have words with more > irregular inflection in Norwegian too. :) ) > > As indicated by the numeric code, there are actually two root words «hoppe». > One (37933) is a noun, meaning mare (female horse), and the other (37934) is a > verb, meaning «jump». The adjective (code «adj») derived is derived from the > verb, and therefore has the same code as it. «fem» is the gender, «eint» means > singular, and «ub» and «bu» means indefinite and definite form, respectively. > > Is this information of any use when generating the dictionary file, and how can > I use it? From what I've read about hunspell, the main part of the affix file is > only used as a way to compress the dictionary, and doesn't have any effect on > which words are suggested by hunspell. > > If so, could (should) I just use affixcompress on the words in the third column > to generate a dictionary file. It seems to work well. Or is there a way to use > the information on each word to *automatically* improve the suggestions (I > will of course also add suggestion hints in the affix file manually), or reduce > the dictionary size, or improve the speed for lookups and suggestions? Automatic compression is perfect for a spelling dictionary, but the upcoming thesaurus extension needs real data for stemming and needs extra information for morphological generation. The automatic dictionary compression has a drawback for stemming, the possible artificial morphology: $ hunspell -d en windows + wind (This is not too good for the dictionary based suggestion, too.) Future versions of affixcompress will be able to use word frequency data to correct the stem analysis. Your dictionary development needs a new script to keep the real stems (you can add irregular forms to the dic file: "mice st:mouse", see http://www.openoffice.org/issues/show_bug.cgi?id=19563) and encode the morphological informations in the dictionary. When you need this development for the Norwegian thesaurus, I will help you. Thanks for your questions. Regards, László > > Thanks in advance for your reply. > > -- > Regards, > Karl Ove Hufthammer > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
