I have given a lot of thought on compressing aspell compiled word list and here is what I have come up with. 1. Affix Compression) It would still be possible to use affix compression with the soundslike lookup however the compiled word list will still take up more space than Ispell's. The data would be laid out roughly as follows. <base word><affix flags><base word><affix flags>.... <word hash> (size * base words * 1.25 * 4) <soundslike base><affix flag 1-2 bytes><word list><affix flag><word list> <soundslike hash> (size * base soundslike * 1.25 * 4) Where the word list will consist of <num of words, 2 bytes> <base word offset 3 bytes, affix flag 1 byte> <base word>... The approximate size would be between (size of ispell hash table) + (num of expanded words)*4 and (size of ispell hash table) * 2 + (num of expanded words)*4 Depending on the roughness of the soundslike data. A little more space can be saved by using 3 byte words instead of 4 byte integers in the hash tables. 2. Huffman) The other idea I had was to count the frequencies that all single letters appear, all strings of length 2 appear, all strings of length 3, etc... And then encode it via Huffman code. There are some good notes of Huffman coding at http://www.compressconsult.com/huffman/. If you don't know what Huffman coding is there is a good introduction at http://www.faqs.org/faqs/compression-faq/part2/section-1.html. If I can compress most words to under 4 bytes then I can store the words directly in the hash table so the space would become. # of words * (1.25*4+4 = 9) + # of soundslike * (1.25 * 4 = 5) + size of prefix code lookup tables. The tricky part would be 1. Identifying when a string is common enough to justify a prefix code. 2. Determine where to split the string. For example if the following codes are available: freq 1, equency 2, ency 3, ue 4 The word frequency can be encoded in several different ways: 14ncy 1u3 fr2 Which one is optimal depends on the length of the prefix code for each string and the individual letters. This problem should be able to be solved effectually using dynamic programming as it seams closely related to the matrix chain multiplication problem. However, the choice of code words to use will also affect the frequency of the strings and thus the optimal prefix codes.... Additional Notes) Both of these methods will create problems with the new code I added in the latest snapshot to control compound formation. The affix compression method should still work provided that if the base word can be used in a compound so can all the expanded forms. For the Huffman method the compound information would either need to be stored separately or merged in with the word. If they are stored separately another (# word words) bytes will need to be used. The existing method only takes up extra space if the word has compound information attached to it. A. Implication) I don't plan on implanting either of these methods any time real soon now as I have lots of other thing I want to get done first in Aspell. However if someone plans on implanting either of these methods I would gladly work with you. B. Feedback) As always feedback is more than welcome. I am especially in feedback on the Huffman method. --- Kevin Atkinson [EMAIL PROTECTED] http://metalab.unc.edu/kevina/ _______________________________________________ aspell-user mailing list [EMAIL PROTECTED] http://lists.sourceforge.net/mailman/listinfo/aspell-user
