So, I've been playing with compression algorithms, by way of fun and games.

In particular, I've built a 2nd-order adaptive Huffman encoder in Python. It's not designed to be particularly efficient, but it's possible to build one of these in about 64k or so, I think - that's considerably less than the 256k that a DEFLATE compressor uses. (A zlib-based codec costs 300k total at maximum compression, so it's not equal, whereas this would be.)

The simple explanation of how it works is that it picks an encoding based on the probability so far of the symbol occuring given the last one. More probable symbols gain shorter codes, and the most probable usually ends up with a 1- or 2-bit code. Unknown symbols generate additional codes, and if they've never occured before at all, get dumped literally.

Thus "Test" compresses from 4 to 6 - actually encoded as SYM_NYT "T" SYM_NYT "e" ... SYM_FLUSH, so there's 9 symbols in there.

"TestTestTestTest" on the other hand compresses from 12 to 9 - actually those last three "Test" strings compress to 1-bit per symbol each, because the compressor has learnt the notion.

Now, there is a point to all this, shocking as it may seem. What I wanted to do was examine what happens if we change the symbol-set we input into the compression engine, so (and you'll see this at the end of the attached Python code) I wrote out a simple bit of XMPP - actually a misleadingly simple bit.

A DEFLATE compressor should do well on this, because there's lots of repeated substrings.

My Huffman encoder, working on the string itself, brings it from 314 to 165 - that's 52.55% - and DEFLATE manages to get 36.94%. So, the moral of the story here is that, on data which in principally textual in nature, like XML, you don't want to be using a low-order adaptive Huffman encoder. But we all knew that anyway - dictionary compressors are pretty spiffy on long repeated substrings, and DEFLATE uses a secondary (1st order, non-adaptive) Huffman.

But what happens if - instead of compressing the data as a string, the Huffman alphabet includes the events generated by parsing the XML?

When we compress "<presence/>" traditionally, we compress it using 11 symbols - one for each letter. This is handy, as it means it generic for any text, but as it happens, we don't have text - we have XML, and we can use that to our advantage.

So, instead, my compressor can compress it using symbols which describe the XML rather than the textual representation of it - in this case, amusingly, it'll make no difference, as it comes out as SYM_STARTTAG, "presence", SYM_ENDTAG, SYM_CLOSETAG. It makes slightly more difference if we compress "<presence></presence>", as the string-based compressor has nearly double the symbols, whereas a XML-based one has exactly the same.

The moment I do this, then I'm marginally beating DEFLATE - no mean feat in half the memory, albeit it might just be a product of the data.

I can also add in one more trick, which'll certainly increase the memory consumption, but ought to gain some serious savings - if I allow the compressor to identify certain strings - like element and attribute names - that are likely to be repeated, and assign them symbols, then in principle similar structures in XML should compress exceedingly well.

Testing shows that at least on the one fragment I've tried, there's a slim improvement - about 1.5% overall, or 3.7% relative to the XML-based compression.

The memory would increase, but on an entity with several sessions, these strings are likely to be shared between sessions, so I'm not entirely convinced I've broken the bank quite yet on memory usage, but it's certainly a more complex proposition to measure.

I'll adapt my script - and I make no apologies for how ugly the code is or the quantity of obtuse debugging data it outputs - to read in actual XMPP transcripts and see how it performs. Meanwhile, have fun.

(Oh, and no, I've not written the XML-based decompressor, yet.)

Dave.
--
Dave Cridland - mailto:d...@cridland.net - xmpp:d...@dave.cridland.net
 - acap://acap.dave.cridland.net/byowner/user/dwd/bookmarks/
 - http://dave.cridland.net/
Infotrope Polymer - ACAP, IMAP, ESMTP, and Lemonade
# Input/output may be assumed to be structurally well-formed XML in UTF-8.
# On Tuesdays.
# (Summer only).

import xml.parsers.expat

SYM_NYT = 0
SYM_EOS = -1
SYM_FLUSH = -2

SYM_STARTTAG = -3
SYM_ATTRNAME = -4
SYM_ATTRVAL = -5
SYM_ENDTAG = -6
SYM_CLOSETAG = -7

SYM_DEFSYM = -8

SYM_MACRO_START = -100

class tree_node:
    def __init__(self, ch, number):
        self.sym = ch
        if self.sym is not None and not isinstance(self.sym,int):
            raise "Hell"
        if self.sym is not None and self.sym != 0:
            print "*create",`self.sym`
        self.parent = None
        self.left = None
        self.right = None
        self.weight = 0
        self.number = number
        
    def bump(self):
        self.weight += 1
            
    def set_right(self, t):
        self.right = t
        if t is not None:
            t.parent = self
        
    def set_left(self, t):
        self.left = t
        if t is not None:
            t.parent = self
        
    def swap(self, t):
        tmp = self.sym
        self.sym = t.sym
        t.sym = tmp
        
        tmp = self.left
        self.set_left(t.left)
        t.set_left(tmp)
        
        tmp = self.right
        self.set_right(t.right)
        t.set_right(tmp)
        
    def trace(self, bit):
        if not bit:
            return self.left
        return self.right

class tree:
    def __init__(self, sym):
        self.number = 0
        self.node_map = {}
        self.nodes = {}
        self.root = self.tree_node(SYM_NYT)
        self.sym = sym

    def tree_node(self, ch = None):
        t = tree_node(ch, self.number)
        self.nodes[self.number] = t
        self.number += 1
        if ch is not None:
            if ch in self.node_map:
                self.node_map[ch].sym = None
            self.node_map[ch] = t
        return t
            
    def update_nyt(self, ch):
        if ch == -2:
            print "*here"
        nyt = self.node_map[SYM_NYT]
        nsym = self.tree_node(ch)
        nnyt = self.tree_node(SYM_NYT)
        nyt.set_left(nnyt)
        nyt.set_right(nsym)
        nsym.bump()
        return nyt

    def update(self, ch):
        print "*update",`ch`,"in tree",`self.sym`
        if ch not in self.node_map:
            node = self.update_nyt(ch)
            if node is self.root:
                node.bump()
                return
        else:
            node = self.node_map[ch]
        while True:
            self.print_tree()
            print "*checking node",`node.number`
            lowest_node = None
            lowest_number = node.number
            for k,v in self.nodes.items():
                if v.weight == node.weight and v is not node.parent:
                    if k < lowest_number:
                        print "*equal weight and lower:",`v.number`
                        lowest_node = v
                        lowest_number = k
            if lowest_node is not None:
                print "*swap   ",`lowest_node.number`,`lowest_node.sym`,`node.number`,`node.sym`
                self.print_tree()
                lowest_node.swap(node)
                if lowest_node.sym is not None:
                    self.node_map[lowest_node.sym] = lowest_node
                if node.sym is not None:
                    self.node_map[node.sym] = node
                if node.parent is None:
                    self.root = node
                print "*swapped",`lowest_node.sym`,`node.sym`
                node = lowest_node
                self.print_tree()
            node.bump()
            if node is self.root:
                break
            node = node.parent
        self.print_tree()
    
    def print_tree(self, node=None, count=0):
        if node is None:
            node = self.root
        if count > 20:
            return
        print "%s%d <%d> [%s]" % (' ' * count, node.number, node.weight, `node.sym`)
        if node.left:
            self.print_tree(node.left, count+1)
            if node.left.parent is not node:
                print "ERROR!"
            self.print_tree(node.right, count+1)            
            if node.right.parent is not node:
                print "ERROR!"
            
    def find(self,ch):
        if ch not in self.node_map:
            return self.find(SYM_NYT)
        node = self.node_map[ch]
        path = []
        while node is not self.root:
            if node.parent.left is node:
                path.append(0)
            else:
                path.append(1)
            node = node.parent
        path.reverse()
        return path
    
    def is_nyt(self, ch):
        if ch in self.node_map:
            return False
        return True

class core:
    SYMS_LIKELY = 10
    SYMS_UNLIKELY = 0
    
    def __init__(self, dm=False):
        self.define_macros = dm
        self.state = []
        self.total = 0
        for sym in range(256):
            self.state.append(tree(sym))
        self.state[SYM_NYT].print_tree()
        self.state[SYM_NYT].update(SYM_EOS)
        self.state[SYM_NYT].print_tree()
        self.state[SYM_NYT].update(SYM_FLUSH)
        self.state[SYM_NYT].print_tree()
        if self.define_macros:
            self.state[SYM_NYT].update(SYM_DEFSYM)
        self.state[SYM_NYT].update(SYM_NYT)
        self.state[SYM_NYT].print_tree()
        self.macros = {}
        self.macro_next = SYM_MACRO_START
        self.last = SYM_NYT
        self.bitbuf = 0
        self.bitoff = 0
        self.out = ''
        self.xml = None
    
    def encode_sym(self, ch):
        tree = self.state[self.last]
        if tree.number == 1:
            tmp = self.last
            self.last = SYM_NYT
            self.encode_sym(ch)
            self.last = tmp
        else:
            bits = tree.find(ch)
            if bits:
                print "Encoding %d bits from tree %d" % (len(bits), tree.sym), `bits`
                self.output_bits(bits)
                cn = tree.root
                for bit in bits:
                    cn = cn.trace(bit)
                    if cn.sym is not None:
                        print "*sym",`cn.sym`
                if tree.is_nyt(ch):
                    if cn.sym != 0:
                        raise "Argh"
                elif cn.sym != ch:
                    raise "Argh2"
            else:
                print "No bits to encode from tree",`tree.sym`
                raise "No bits to encode?"
            if tree.is_nyt(ch):
                print "Encoding octet"
                self.output_octet(ch)
        self.state[self.last].update(ch)
        self.last = ch
    
    def encode_string(self, uu, last=False):
        self.encode_octets(uu)
        if last:
            self.encode_sym(SYM_EOS)
        else:
            self.encode_sym(SYM_FLUSH)
        self.output_flush()
        return self.out
    
    def encode_octets(self, uu):
        for ch in uu:
            sym = ord(ch)
            if sym == 0:
                break
            self.encode_sym(sym)
    
    def encode_name(self, name):
        if not self.define_macros:
            self.encode_octets(name)
            return
        if name not in self.macros:
            print "Defining macro for",`name`
            self.encode_sym(SYM_DEFSYM)
            self.encode_octets(name)
            m = self.macro_next
            self.macro_next -= 1
            self.macros[name] = m
            self.state[SYM_NYT].update(m)
            self.state[self.last].update(m)
        else:
            print "Using existing macro",`name`
            self.encode_sym(self.macros[name])
    
    def starttag(self, tag, attrs):
        self.encode_sym(SYM_STARTTAG)
        self.encode_name(tag)
        for k,v in attrs.items():
            self.encode_sym(SYM_ATTRNAME)
            self.encode_name(k)
            self.encode_sym(SYM_ATTRVAL)
            self.encode_octets(v)
        self.encode_sym(SYM_ENDTAG)
        
    def endtag(self, tag):
        self.encode_sym(SYM_CLOSETAG)
    
    def cdata(self, data):
        self.encode_octets(data)
    
    def encode_xml(self, xmldata, last=False):
        if self.xml is None:
            self.xml = xml.parsers.expat.ParserCreate()
            self.xml.StartElementHandler = self.starttag
            self.xml.EndElementHandler = self.endtag
            self.xml.CharacterDataHandler = self.cdata
            self.state[SYM_NYT].update(SYM_STARTTAG)
            self.state[SYM_NYT].update(SYM_ATTRNAME)
            self.state[SYM_NYT].update(SYM_ATTRVAL)
            self.state[SYM_NYT].update(SYM_ENDTAG)
            self.state[SYM_NYT].update(SYM_CLOSETAG)
            self.state[SYM_STARTTAG].update(SYM_DEFSYM)
            self.state[SYM_ATTRNAME].update(SYM_DEFSYM)
        self.xml.Parse(xmldata, 1)
        if last:
            self.encode_sym(SYM_EOS)
        else:
            self.encode_sym(SYM_FLUSH)
        self.output_flush()
        return self.out        
    
    def output_bits(self, bits):
        for b in bits:
            self.bitbuf <<= 1
            self.bitoff += 1
            self.bitbuf += b
            if self.bitoff == 8:
                self.out += chr(self.bitbuf)
                self.bitoff = 0
                self.bitbuf = 0
    
    def output_flush(self):
        while self.bitoff != 0:
            self.output_bits([0])
    
    def output_octet(self, ch):
        path = []
        while len(path) < 8:
            path.append(ch & 1)
            ch >>= 1
        path.reverse()
        self.output_bits(path)
        
    def decode(self, ss):
        last = SYM_NYT
        lastch = SYM_EOS
        literal_mode = False
        bc = 0
        bb = 0
        cn = self.state[SYM_NYT].root
        ba = []
        for ch in ss:
            bits = ord(ch)
            bitoff = (1 << 7)
            while bitoff > 0:
                bit = 0
                if (bitoff & bits):
                    bit = 1
                bitoff >>= 1
                if literal_mode:
                    bb <<= 1
                    bb += bit
                    bc += 1
                    if bc == 8:
                        literal_mode = False
                        if bb == SYM_EOS:
                            return self.out
                        self.out += chr(bb)
                        print "Literal done",`self.out`
                        self.state[SYM_NYT].update(bb)
                        if lastch > 0:
                            self.state[lastch].update(bb)
                        bc = 0
                        ba = []
                        last = bb
                        lastch = bb
                        cn = self.state[lastch].root
                        if self.state[lastch].number == 1:
                            #print "(Assuming NYT)"
                            last = SYM_NYT
                            cn = self.state[SYM_NYT].root
                else:
                    print "Bit is",`bit`,`self.out`,`cn.sym`
                    bc += 1
                    cn = cn.trace(bit)
                    ba.append(bit)
                    print "Move to",`cn.sym`
                    if cn.sym is not None:
                        #print "Have symbol"
                        print "Decoded %d bits" % bc, `ba`
                        bc = 0
                        ba = []
                        if cn.sym == SYM_EOS or cn.sym == SYM_FLUSH:
                            if cn.sym == SYM_EOS:
                                print "Found EOS"
                            else:
                                print "Found FLUSH"
                            return self.out
                        if cn.sym == SYM_NYT and last == SYM_NYT:
                            #print "Double NYT - literal mode"
                            bb = 0
                            bc = 0
                            literal_mode = True
                        else:
                            #print "Actual symbol"
                            if cn.sym > 0:
                                print "Printable symbol", chr(cn.sym)
                                self.out += chr(cn.sym)
                                print "Update",`lastch`,"for",`cn.sym`
                                self.state[lastch].update(cn.sym)
                                lastch = cn.sym
                            elif cn.sym == SYM_NYT:
                                print "Explicit NYT"
                            last = cn.sym
                            #print "Reset root"
                            cn = self.state[last].root
                            #print "Tree count:",`self.state[last].number`
                            if self.state[last].number == 1:
                                #print "Therefore Assuming NYT"
                                last = SYM_NYT
                                cn = self.state[SYM_NYT].root
                                #print "Tree count:",`self.state[SYM_NYT].number`
        print "ERROR"

if __name__ == '__main__':
    import sys
    import zlib

    # Check output routines.
    comp = core()
    comp.output_bits([0])
    comp.output_octet(ord('T'))
    comp.output_bits([1])
    comp.output_octet(ord('e'))
    comp.output_bits([1])
    comp.output_octet(ord('s'))
    comp.output_bits([1])
    comp.output_octet(ord('t'))
    comp.output_bits([0])
    comp.output_flush()
    
    print `comp.out`
    
    # Now check tree.
    print "\n**START\n"
    t = tree('0')
    print
    t.print_tree()
    print
    t.update(ord('a'))
    t.print_tree()
    print
    t.update(ord('a'))
    t.print_tree()
    print
    t.update(ord('r'))
    t.print_tree()
    print
    t.update(ord('d'))
    t.print_tree()
    print
    t.update(ord('v'))
    t.print_tree()
    print
    print
    print

    # Encode and decode an empty string.
    comp = core()
    print `comp.state[SYM_NYT].find(SYM_EOS)`
    print `comp.state[SYM_NYT].find(SYM_NYT)`
    comp.encode_string('')

    print `comp.out`
    print
    
    decomp = core()
    decomp.decode(comp.out)
    
    print `decomp.out`
    print

    # Same with "Test"
    comp = core()
    comp.encode_string('Test')
    
    print `comp.out`, 4, len(comp.out)
    print

    decomp = core()
    decomp.decode(comp.out)
    
    print `decomp.out`
    print

    # Longer string with repeats,
    # this exercises the adaptive 2nd order trees a bit.
    comp = core()
    comp.encode_string('TestTestTestTest')

    print `comp.out`, 12, len(comp.out)
    decomp = core()
    decomp.decode(comp.out)
    
    print `decomp.out`
    print
    
    # Now for real tests.
    # The following XML is used for all tests.
    test_xml = '''<?xml version="1.0"?>
<stream:stream>
<presence to='j...@jabber.org'><status>Here</status></presence>
<presence to='p...@jabber.org'><status>Here</status></presence>
<presence to='geo...@jabber.org'><status>Here</status></presence>
<presence to='ri...@jabber.org'><status>Here</status></presence>
</stream:stream>
'''
    
    comp = core()
    comp.encode_string(test_xml)
    
    test_xml_string = comp.out
    
    comp = core()
    comp.encode_xml(test_xml)
    
    test_xml_xml = comp.out
    
    comp = core(True)
    comp.encode_xml(test_xml)
    
    test_xml_xmld = comp.out
    
    test_xml_deflate = zlib.compress(test_xml, 9)
    
    print `len(test_xml)`, `len(test_xml_string)`, `len(test_xml_xml)`, `len(test_xml_xmld)`, `len(test_xml_deflate)`
    print "%02.2f %02.2f %02.2f %02.2f %02.2f" % (len(test_xml) * 100.0 / len(test_xml), len(test_xml_string) * 100.0 / len(test_xml), len(test_xml_xml) * 100.0 / len(test_xml), len(test_xml_xmld) * 100.0 / len(test_xml), len(test_xml_deflate) * 100.0 / len(test_xml))
    print "%02.2f %02.2f %02.2f %02.2f %02.2f" % (len(test_xml) * 100.0 / len(test_xml_string), len(test_xml_string) * 100.0 / len(test_xml_string), len(test_xml_xml) * 100.0 / len(test_xml_string), len(test_xml_xmld) * 100.0 / len(test_xml_string), len(test_xml_deflate) * 100.0 / len(test_xml_string))
    print "%02.2f %02.2f %02.2f %02.2f %02.2f" % (len(test_xml) * 100.0 / len(test_xml_xml), len(test_xml_string) * 100.0 / len(test_xml_xml), len(test_xml_xml) * 100.0 / len(test_xml_xml), len(test_xml_xmld) * 100.0 / len(test_xml_xml), len(test_xml_deflate) * 100.0 / len(test_xml_xml))

Reply via email to