On Thu, Jun 30, 2005 at 05:39:29AM -0600, Kevin Atkinson wrote:
> 
> A few things to take into consideration.
> 
> 1) To Minimize the Space Used The Word List Should be Compressed with
> "prezip -s".  (The "-s" is to resort the word list using the "C" local
> which is needed for maximum compressed with prezip). And than further
> compressed with bzip2.  You can decompress it by piping it through "bzcat
> | precat".  To give you an idea on sizes using various methods here are
> the file sizes for en-common.wl (.cwl is the word list compressed with
> prezip)
> 
> 1224 en-common.wl
>  424 en-common.cwl
>  136 en-common.cwl.bz2
>  164 en-common.cwl.gz
>  432 en-common.wl.bz2
>  332 en-common.wl.gz
> 
> yes bzip2 is WORSE than gzip on a sorted word list.
> 
> Also prezip and friends consists of an ANSI C program and some shell 
> scripts which can easily be separated out into a separate package so that 
> you can also use them with Ispell if so desired.

Thanks for the hints, Kevin

I assume this applies only to non affix compressed wordlists. I think we
should also encourage affix compression when possible, hash sizes are much
better.

Regarding bzip2, it implies adding another dependency. While most systems
already have it installed I personally prefer using gzip, which must always
be present, even if that implies a larger size.

All the last tests I did were using plain gzip and affix compressed
wordlists. Only the very first tests were done with gzipped raw (no prezip)
wordlists. Since the system seems viable, I will add support for
prezip/precat if wordlist name is of the form .cwl.gz. 

> 
> 2) To avoid spitting out a bunch of warnings during compile you should
> clean it by piping it though "aspell clean strict".  This will remove all
> problem words and affix flags that Aspell will complain about when
> compiling.  The compiled dictionary should be the same with either the
> dirty or the clean word list.  You can also use "aspell clean" but that
> but that handles some errors in a different way and the resulting compiled
> dictionary may be different.
> 
> 3) Aspell by defaults performs a number of checks when creating a 
> word list, some if these can be expensive.  You can disable the expensive 
> one with "--dont-validate-affixes".  If you clean the word list first this 
> should be 100% safe.  It should also be safe to use on a dirty word list 
> as the invalid affix flags don't cause a problem in the compiled word 
> list.  You may also consider using "--validate-words" but those checks are 
> not very expensive.

This is very interesting info, I was worrying about the bunch of bug
reports that would follow all those warnings about non applicable
affix flags. 

Again, thanks for your feedback

Cheers,

-- 
Agustin


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]

Reply via email to