[lingu-dev] How to get list of valid word in hunspell

Sunday Bolaji Mon, 02 Mar 2009 04:46:12 -0800

Hi,
    Please is there any way or command that can be used to get list of all 
valid words in Hunspell library, both the ones in the dictionary file and the 
ones generated using affix rule.
  Secondly, is there any way to let hunspell know that two the same combined 
character write in different way are the same.Example is  the character " ọ́ " 
can be written by first write " o " and add under dot and tone mark or first 
write " ọ " and add tone mark or first write " ó " and add under dot to it.


Regards,
Jeje




--- On Wed,
 2/25/09, Németh László <[email protected]> wrote:
From: Németh László <[email protected]>
Subject: [lingu-dev] Automatic dictionary development and optimization (Re: 
Questions  regarding Hunspell format for new Norwegian dictionaries)
To: [email protected], "Karl Ove Hufthammer" <[email protected]>
Date: Wednesday, February 25, 2009, 7:12 AM

Hi,

2009/2/6 Karl Ove Hufthammer <[email protected]>:
> ... could (should) I just use affixcompress on the words in the third
column
> to generate a dictionary file. It seems to work well. Or is there a way to
use
> the information on each word to *automatically* improve the suggestions (I
> will of course also add suggestion hints in the affix file manually), or
reduce
> the dictionary
 size, or improve the speed for lookups and suggestions?

Affixcompress script is for compression of a huge word list, searching
potential stems and affixes. The most important results of the affix
compression are the smaller memory footprint and shorter loading time.
(In fact, affix rich languages need affix compression.) Compressed
dictionaries may have slower lookups for the compressed words. The
time consuming dictionary based (ngram and phonetic) suggestion is
much faster with smaller dic files (suggestion speed is the bottleneck
during the normal usage of the spelling dictionaries).

Example

Generating a compressed dictionary from the standard English
dictionary (/usr/share/dict/words) of the Linux:

$ LC_ALL=C sort /usr/share/dict/words >en
$ hunspell-1.2.8/src/tools/affixcompress en 1000

The compressed dictionary contains only 30 thousand words in the file
en.dic instead of the 99
 thousand words of the original word list. The
file en.aff contains the predefined 1000 affixes (but it misses some
of the default settings, SET character encoding, TRY definition etc.).

Alias compression is an optimization method for the dictionaries of
the affix rich (agglutinative) languages, but it reduces the memory
usage and improve the affix analysis, too:

$ hunspell-1.2.8/src/tools/makealias en.dic en.aff
output: en_alias.dic, en_alias.aff

Memory usage (RSS, VSZ fields are in kB):

$ hunspell-1.2.8/src/tools/.libs/lt-hunspell -d en0 &
$ hunspell-1.2.8/src/tools/.libs/lt-hunspell -d en &
$ hunspell-1.2.8/src/tools/.libs/lt-hunspell -d en_alias &
$ ps -eo pid,ppid,rss,vsize,pcpu,pmem,cmd -ww --sort=pid | sed -n
'1p;/lt-hunspell/p'
  PID  PPID   RSS    VSZ %CPU %MEM CMD
 6767  6314  5564   8956  0.1  0.2
hunspell-1.2.8/src/tools/.libs/lt-hunspell -d en0
 6768  6314  3444  
 6776  0.0  0.1
hunspell-1.2.8/src/tools/.libs/lt-hunspell -d en
 6769  6314  3160   6492  0.1  0.1
hunspell-1.2.8/src/tools/.libs/lt-hunspell -d en_alias

(The "en0" dictionary is the original word list without affix
compression:
$ cp /usr/share/dict/words en0.dic
$ touch en0.aff)

Also Hunspell 1.2.x has a ZIP-like compression format for dictionary
compression:

$ hunspell-1.2.8/src/tools/hzip en.* en_alias.*
$ ls -lh en.* en_alias.*
-rw-r--r-- 1 laci laci  29K 2009-02-19 15:16 en.aff
-rw-r--r-- 1 laci laci 9,5K 2009-02-19 16:04 en.aff.hz
-rw-r--r-- 1 laci laci 425K 2009-02-19 15:16 en.dic
-rw-r--r-- 1 laci laci 197K 2009-02-19 16:04 en.dic.hz
-rw-r--r-- 1 laci laci 166K 2009-02-19 15:36 en_alias.aff
-rw-r--r-- 1 laci laci  74K 2009-02-19 16:04 en_alias.aff.hz
-rw-r--r-- 1 laci laci 311K 2009-02-19 15:36 en_alias.dic
-rw-r--r-- 1 laci laci 131K 2009-02-19 16:04 en_alias.dic.hz

The
 hzip compressed en_alias dictionary needs 205 kB disk space. (The
size of the original Linux English word list was 910 kB).

Measuring suggestion speed

(The affix file of the dictionaries was extended with the following header:

TRY qwertzuiopasdfghjklyxcvbnm'-
WORDCHARS '-)

Generate 10-character length random misspelled words from the Linux words:

$ sed -n '1~500p' /usr/share/dict/words | tr -d '\n' | grep
-o
'.........' >bad.txt
$ wc -l <bad.txt
184
$ tail bad.txt
'sunderta
kersunoff
iciallyup
rootedvea
ledvindic
ationwage
redwavere
dwhiniest
wintrywra
pzipper's

$ cat bad.txt | time hunspell-1.2.8/src/tools/hunspell -d en
...
& dwhiniest 6 0: whiniest, d whiniest, shiniest, whinniest, grainiest,
brainiest
& wintrywra 4 0: wintry, wintrier, wintriest, wintery
& pzipper's 6 0: zipper's, p zipper's,
 clipper's,
shipper's,
skipper's, slipper's
13.75user 0.01system 0:13.79elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+4945minor)pagefaults 0swaps

$ cat bad.txt | time hunspell-1.2.8/src/tools/hunspell -d en
...
& dwhiniest 6 0: whiniest, d whiniest, whinniest, whinnies, Dniester,
daintiest
& wintrywra 4 0: wintrier, wintriest, wintergreen, Winthrop
& pzipper's 6 0: zipper's, p zipper's, slipper's,
skipper's, Dipper's, kipper's
4.20user 0.01system 0:04.22elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+4374minor)pagefaults 0swaps

$ cat bad.txt | time hunspell-1.2.8/src/tools/hunspell -d en_alias
& dwhiniest 6 0: whiniest, d whiniest, whinniest, whinnies, Dniester,
daintiest
& wintrywra 4 0: wintrier, wintriest, wintergreen, Winthrop
& pzipper's 6 0: zipper's, p zipper's, slipper's,
skipper's, Dipper's,
 kipper's
4.17user 0.01system 0:04.18elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+4297minor)pagefaults 0swaps

So suggestion time depends strongly on the word count of the dic file.

Languages with complex morphology can use the second-level affixation
of Hunspell. There is a new tool "doubleaffixcompress"
(http://downloads.sourceforge.net/hunspell/doubleaffixcompress) to
compress the output dictionary of the affixcompress script or other
Hunspell dictionaries using second-level affixes. For example, on the
old en_US dictionary of Openoffice.org we got 50% compression rate:

$ doubleaffixcompress en_US
$ wc -l en_US.dic new_en_US.dic
  62157 en_US.dic
  30442 new_en_US.dic
$ grep abolish en_US.dic
abolisher/M
abolish/LZRSDG
abolishment/MS
$ grep abolish new_en_US.dic
abolish/5193,6535,64991,64993,64995,64996,64997,65001
$ grep '\(5193\|6535\)'
 new_en_US.aff
SFX  5193 Y 1
SFX  5193 0 er/64999 .
SFX  6535 Y 1
SFX  6535 0 ment/64997,64999 .

A more important result on the (too big) he_IL dictionary. (This
dictionary recognizes more than 100 million Hebrew word forms):

$ LC_ALL=C doubleaffixcompress he_IL
$ wc he_IL.dic new_he_IL.dic
 329237  328996 3212113 he_IL.dic
  37913   37879 1940612 new_he_IL.dic
$ LC_ALL=C ~/hunspell-1.2.8/src/tools/makealias new_he_IL.{dic,aff}
output: new_he_IL_alias.dic, new_he_IL_alias.aff

Memory usage has been reduced from 19 MB to 5.5 MB by
doubleaffixcompress and makealias.

2009/2/6 Karl Ove Hufthammer <[email protected]>:
> Hi!
>
> I couldn't find a mailing list for questions regarding Hunspell, so
I'm writing
> to you. Please feel free to direct me the the relevant mailing list or
forum
> instead of answering me directly.

I will post your letter to the
 Lingucomponent development list of
OpenOffice.org with a detailed example,

>
> I am about to create a new spellchecker for the Norwegian Nynorsk language
> (and possibly Norwegian Bokmål too), based on Hunspell. However, I have
some
> questions on how to best proceed.
>
> We are lucky enough to have access to (GPL 3+-based) fullform dictionary
for
> Norwegian, which most other languages using Hunspell doesn't seem to
have.
> But I'm not sure how to best make use of the information in this
databae. Here
> is an example output, for the word «hoppe»:
>
> 37933   hoppe   hoppe   subst fem appell eint ub
> 37933   hoppe   hoppa   subst fem appell eint ub
> 37933   hoppe   hoppa   subst fem appell eint bu
> 37933   hoppe   hopper  subst fem appell fl ub
> 37933   hoppe   hoppor  subst fem appell fl ub
> 37933   hoppe   hoppene subst fem appell
 fl bu
> 37933   hoppe   hoppone subst fem appell fl bu
> 37934   hoppe   hoppe   verb inf
> 37934   hoppe   hoppa   verb inf
> 37934   hoppe   hoppar  verb pres
> 37934   hoppe   hoppast verb inf pres st-form
> 37934   hoppe   hoppa   verb pret
> 37934   hoppe   hoppa   verb perf-part
> 37934   hoppe   hoppa   adj <perf-part> nøyt ub eint
> 37934   hoppe   hoppa   adj <perf-part> m/f ub eint
> 37934   hoppe   hoppa   adj <perf-part> bu eint
> 37934   hoppe   hoppa   adj <perf-part> fl
> 37934   hoppe   hoppande        adj <pres-part>
> 37934   hoppe   hopp    verb imp
> 37934   hoppe   hoppe   verb imp
> 37934   hoppe   hoppa   verb imp
>
> (Here the code «subst» means noun. And yes, we *do* have words with more
> irregular inflection in Norwegian too. :) )
>
> As indicated by the numeric code, there
 are actually two root words
«hoppe».
> One (37933) is a noun, meaning mare (female horse), and the other (37934)
is a
> verb, meaning «jump». The adjective (code «adj») derived is derived
from the
> verb, and therefore has the same code as it. «fem» is the gender,
«eint» means
> singular, and «ub» and «bu» means indefinite and definite form,
respectively.
>
> Is this information of any use when generating the dictionary file, and
how can
> I use it? From what I've read about hunspell, the main part of the
affix file is
> only used as a way to compress the dictionary, and doesn't have any
effect on
> which words are suggested by hunspell.
>
> If so, could (should) I just use affixcompress on the words in the third
column
> to generate a dictionary file. It seems to work well. Or is there a way to
use
> the information on each word to
 *automatically* improve the suggestions (I
> will of course also add suggestion hints in the affix file manually), or
reduce
> the dictionary size, or improve the speed for lookups and suggestions?

Automatic compression is perfect for a spelling dictionary, but the
upcoming thesaurus extension needs real data for stemming and needs
extra information for morphological generation.

The automatic dictionary compression has a drawback for stemming, the
possible artificial morphology:

$ hunspell -d en
windows
+ wind

(This is not too good for the dictionary based suggestion, too.)

Future versions of affixcompress will be able to use word frequency
data to correct the stem analysis.
Your dictionary development needs a new script to keep the real stems
(you can add irregular forms to the dic file:
"mice st:mouse", see
http://www.openoffice.org/issues/show_bug.cgi?id=19563) and encode
 the
morphological informations in the dictionary. When you need this
development for the Norwegian thesaurus, I will help you.

Thanks for your questions.
Regards,
László

>
> Thanks in advance for your reply.
>
> --
> Regards,
> Karl Ove Hufthammer
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[lingu-dev] How to get list of valid word in hunspell

Reply via email to