On Thu, 30 Jun 2005, Agustin Martin wrote: > On Thu, Jun 30, 2005 at 05:39:29AM -0600, Kevin Atkinson wrote: > > > > A few things to take into consideration. > > > > 1) To Minimize the Space Used The Word List Should be Compressed with > > "prezip -s". (The "-s" is to resort the word list using the "C" local > > which is needed for maximum compressed with prezip). And than further > > compressed with bzip2. You can decompress it by piping it through "bzcat > > | precat". To give you an idea on sizes using various methods here are > > the file sizes for en-common.wl (.cwl is the word list compressed with > > prezip) > > > > 1224 en-common.wl > > 424 en-common.cwl > > 136 en-common.cwl.bz2 > > 164 en-common.cwl.gz > > 432 en-common.wl.bz2 > > 332 en-common.wl.gz > > > > yes bzip2 is WORSE than gzip on a sorted word list. > > > > Also prezip and friends consists of an ANSI C program and some shell > > scripts which can easily be separated out into a separate package so that > > you can also use them with Ispell if so desired. > > Thanks for the hints, Kevin > > I assume this applies only to non affix compressed wordlists.
No it also helps with affix compressed word lists. > I think we > should also encourage affix compression when possible, hash sizes are much > better. If the language in question does not have does not have any sort of soundslike data, than affix compression is a clear win; however, if the language uses phonetic soundslike data (such as German, French, and most importunately English) than the choice to enable affix compression in the compiled hash table is much more difficult. Aspell will support both affix compression and phonetic soundslike lookup but the results may not be as good. Thus I, in general, don't recommend it. You should use the settings in the official dictionary package. > Regarding bzip2, it implies adding another dependency. While most systems > already have it installed I personally prefer using gzip, which must always > be present, even if that implies a larger size. Well bzip2 is very common now and in fact my dictionary packages use bzip2 and no one has complained yet. This sounds like a policy decision that should possibly be discussed with other Debian developers (or whoever the appropriate people are on policy decisions such as that). One point against bzip2 is that it is slower, however this may not make much of a difference as most of the time will be spend in compiling the word list: $ time gunzip -c < en-common.cwl.gz | precat | aspell --lang=en create master ./en-common.rws real 0m3.160s user 0m3.020s sys 0m0.130s $ time bunzip2 -c < en-common.cwl.bz2 | precat | aspell --lang=en create master ./en-common.rws real 0m3.438s user 0m3.280s sys 0m0.130s I won't say anything else on this point, as it is not something I fell very strongly about. > All the last tests I did were using plain gzip and affix compressed > wordlists. Only the very first tests were done with gzipped raw (no prezip) > wordlists. Since the system seems viable, I will add support for > prezip/precat if wordlist name is of the form .cwl.gz. -- http://kevin.atkinson.dhs.org -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]