On Thu, 30 Jun 2005, Agustin Martin wrote:

> On Thu, Jun 30, 2005 at 05:39:29AM -0600, Kevin Atkinson wrote:
> > 
> > A few things to take into consideration.
> > 
> > 1) To Minimize the Space Used The Word List Should be Compressed with
> > "prezip -s".  (The "-s" is to resort the word list using the "C" local
> > which is needed for maximum compressed with prezip). And than further
> > compressed with bzip2.  You can decompress it by piping it through "bzcat
> > | precat".  To give you an idea on sizes using various methods here are
> > the file sizes for en-common.wl (.cwl is the word list compressed with
> > prezip)
> > 
> > 1224 en-common.wl
> >  424 en-common.cwl
> >  136 en-common.cwl.bz2
> >  164 en-common.cwl.gz
> >  432 en-common.wl.bz2
> >  332 en-common.wl.gz
> > 
> > yes bzip2 is WORSE than gzip on a sorted word list.
> > 
> > Also prezip and friends consists of an ANSI C program and some shell 
> > scripts which can easily be separated out into a separate package so that 
> > you can also use them with Ispell if so desired.
> 
> Thanks for the hints, Kevin
> 
> I assume this applies only to non affix compressed wordlists. 

No it also helps with affix compressed word lists.

> I think we
> should also encourage affix compression when possible, hash sizes are much
> better.

If the language in question does not have does not have any sort of 
soundslike data, than affix compression is a clear win;  however, if 
the language uses phonetic soundslike data (such as German, French, and 
most importunately English) than the choice to enable affix compression 
in the compiled hash table is much more difficult.  Aspell will support 
both affix compression and phonetic soundslike lookup but the results may 
not be as good.  Thus I, in general, don't recommend it.  You should use 
the settings in the official dictionary package.

> Regarding bzip2, it implies adding another dependency. While most systems
> already have it installed I personally prefer using gzip, which must always
> be present, even if that implies a larger size.

Well bzip2 is very common now and in fact my dictionary packages use bzip2
and no one has complained yet.  This sounds like a policy decision that
should possibly be discussed with other Debian developers (or whoever the
appropriate people are on policy decisions such as that).

One point against bzip2 is that it is slower, however this may not make 
much of a difference as most of the time will be spend in compiling the 
word list:

$ time gunzip -c < en-common.cwl.gz | precat | aspell  --lang=en create 
master ./en-common.rws

real    0m3.160s
user    0m3.020s
sys     0m0.130s

$ time bunzip2 -c < en-common.cwl.bz2 | precat | aspell  --lang=en create 
master ./en-common.rws

real    0m3.438s
user    0m3.280s
sys     0m0.130s

I won't say anything else on this point, as it is not something I fell 
very strongly about.

> All the last tests I did were using plain gzip and affix compressed
> wordlists. Only the very first tests were done with gzipped raw (no prezip)
> wordlists. Since the system seems viable, I will add support for
> prezip/precat if wordlist name is of the form .cwl.gz. 

-- 
http://kevin.atkinson.dhs.org



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]

Reply via email to