Helmut Jarausch: > I'd ask in comp.compression where the specialists are listening and who are > very helpful.
Asking in comp.compression is a good starting point. My suggestions (sorry if they look a bit unsorted): it depends on what language you want to use, how much you want to compress the strings, if they are ASCII or unicode, how much quickly you want to compress/ decompress them, and how much time you have to write the encoder/ decoder. You can use Python or a C/D extension, if you use C you will need more time to develop it, but you will be quite more free to use any algorithm you want (well, you are free with Python too, but the code may be slower if it performs low level things), so D may be an acceptable compromise. Generally it's better to start with the simplest solution, and go on from there looking for more complex ones if the first one isn't enough. In Python the simplest thing to try is to compress the strings with zip: "George Washington".encode("zip") Probably the result is longer than the original (25 bytes output for this string). With "bz2" the results are probably worse (56 bytes for this string). A possible alternative it to compress many (10-30) strings at a time, using an external table of their starting points (you can put the table at the beginning), or you can separate those strings with a separator char that's never present in the strings; and then you can zip that group. If you have 1 million strings you can divide them into such little groups, but the indexing in the DB becomes complex, and I don't like this solution much. A bit less simple python solution is to keep a fixed corpus of text and compress it with and without the little string you want to add (like you say), but compressing/decompressing everything each time (if the corpus is small enough this isn't that slow), for example: >>> s2 = a 2488 bytes long text >>> len(s2) 2488 >>> z1 = (s2).encode("zip") >>> len(z1) 1321 >>> s = 'George Washington' >>> len((s).encode("zip")) 25 >>> len(s.encode("bz2")) 56 >>> z2 = (s2+s).encode("zip") >>> len(z2) 1332 >>> len(z2) - len(z1) 11 So you need to store only this 11 byte long string to be able to decompress it. A bigger corpus (s2) allows you a bit more compression, but it requires more time for compression/decompression: >>> len(s3) # s3 is a longer English text 10811 >>> z3 = s3.encode("zip") >>> len(z3) 4971 >>> z4 = (s3+s).encode("zip") >>> len(z4) 4980 >>> len(z4) - len(z3) 9 One of the simpler ways to compress short ASCII English texts is to see that the most frequent chars are very few, so you can encode the 15 most common English chars with 1 nibble, and the other ones with 3 nibbles (first nibble is an escape, and the two following are the whole byte), if you use a low level language you can encode this with few lines of code, it's very fast and it may shave off 25-33% of your bytes. It means that your "George Washington" (17 chars long) may become about 12-13 chars long. I have implemented this with Python too, it's easy. Another possible solution is to use a Huffman compression with fixed symbol frequencies. In Python there are ways to write such compressor/ decompressor in just 10-15 lines code. You can use the heapq to speed up the processing. This may lead to more compression, but I presume not much than less than 4-5 bpc. But if you want to compress/ decompress a LOT of strings you may need to write (or look for the code) in C/D. This is one implementation that doesn't use heapq yet: from operator import itemgetter def clean_tree(obj): if isinstance(obj, tuple): return obj[0] else: return [clean_tree(obj[0][0]), clean_tree(obj[0][1])] def huffman_generate(freqsl): while len(freqsl) > 2: freqsl.sort(key=itemgetter(1)) freqsl = [ [freqsl[0:2], freqsl[0][1] + freqsl[1][1]] ] + freqsl[2:] return [freqsl, -1] def tree2dict(tree): def encode(tree, out=""): if isinstance(tree, list): encode(tree[0], out+"0") encode(tree[1], out+"1") else: dic[tree[0]] = out return dic dic = {} return encode(tree) freqs = [('a', 45), ('b', 13), ('c', 12), ('d', 16), ('e', 9), ('f', 5)] print tree2dict(clean_tree(huffman_generate(freqs))) If you want to use a Huffman you can use the usual tricks to increase compression: use 2-char symbols, or even use whole words as symbols (but you can use whole words only if you are sure your language is English with few strange words). If you want more compression, like a 6 bytes (2.8 bpc) output you will need more complex algorithms. On long English texts the best compressors need less than 0.9 bpc, but they are surely useless for you. I don't think you will be able to go down to 6 bytes for that string, unless you are ready to pay lot of computational time :-) Bye, bearophile -- http://mail.python.org/mailman/listinfo/python-list