[aspell] Compiled Word List Compression Notes

Kevin Atkinson Sat, 24 Jun 2000 08:04:07 -0700

I have given a lot of thought on compressing aspell compiled word list
and here is what I have come up with.

1. Affix Compression)

It would still be possible to use affix compression with the soundslike
lookup however the compiled word list will still take up more space than
Ispell's.  The data would be laid out roughly as follows.

<base word><affix flags><base word><affix flags>....
<word hash> (size * base words * 1.25 * 4)
<soundslike base><affix flag 1-2 bytes><word list><affix flag><word list>
<soundslike hash> (size * base soundslike * 1.25 * 4)

Where the word list will consist of
<num of words, 2 bytes>
  <base word offset 3 bytes, affix flag 1 byte>
  <base word>...

The approximate size would be between 
(size of ispell hash table)     + (num of expanded words)*4 and
(size of ispell hash table) * 2 + (num of expanded words)*4
Depending on the roughness of the soundslike data.

A little more space can be saved by using 3 byte words instead of
4 byte integers in the hash tables.

2. Huffman)

The other idea I had was to count the frequencies that all single letters
appear, all strings of length 2 appear, all strings of length 3, etc...
And then encode it via Huffman code.

There are some good notes of Huffman coding at
http://www.compressconsult.com/huffman/.  If you don't know what Huffman
coding is there is a good introduction at 
http://www.faqs.org/faqs/compression-faq/part2/section-1.html.

If I can compress most words to under 4 bytes then I can store the words
directly in the hash table so the space would become.

# of words * (1.25*4+4 = 9) + # of soundslike * (1.25 * 4 = 5)
+ size of prefix code lookup tables.

The tricky part would be 

1.

Identifying when a string is common enough to justify a prefix code.

2. 

Determine where to split the string.  For example if the following
codes are available:

   freq 1, equency 2, ency 3, ue 4

The word frequency can be encoded in several different ways:

  14ncy
  1u3
  fr2

Which one is optimal depends on the length of the prefix code for each 
string and the individual letters.  This problem should be able to be
solved effectually using dynamic programming as it seams closely related to
the matrix chain multiplication problem.

However, the choice of code words to use will also affect the frequency of
the strings and thus the optimal prefix codes....

Additional Notes)

Both of these methods will create problems with the new code I added in
the latest snapshot to control compound formation.  The affix compression
method should still work provided that if the base word can be used in a
compound so can all the expanded forms.  For the Huffman method the
compound information would either need to be stored separately or merged in
with the word.  If they are stored separately another (# word words) bytes
will need to be used.  The existing method only takes up extra space if
the word has compound information attached to it.

A. Implication)

I don't plan on implanting either of these methods any time real soon now
as I have lots of other thing I want to get done first in Aspell.  However
if someone plans on implanting either of these methods I would gladly work
with you.

B. Feedback)

As always feedback is more than welcome.  I am especially in feedback on
the Huffman method.

--- 
Kevin Atkinson
[EMAIL PROTECTED]
http://metalab.unc.edu/kevina/




_______________________________________________
aspell-user mailing list
[EMAIL PROTECTED]
http://lists.sourceforge.net/mailman/listinfo/aspell-user
[aspell] Compiled Word List Compression Notes

Reply via email to