[Standards] Fun and games with Huffman

Dave Cridland Tue, 04 Aug 2009 02:02:12 -0700

So, I've been playing with compression algorithms, by way of fun andgames.

In particular, I've built a 2nd-order adaptive Huffman encoder inPython. It's not designed to be particularly efficient, but it'spossible to build one of these in about 64k or so, I think - that'sconsiderably less than the 256k that a DEFLATE compressor uses. (Azlib-based codec costs 300k total at maximum compression, so it's notequal, whereas this would be.)

The simple explanation of how it works is that it picks an encodingbased on the probability so far of the symbol occuring given the lastone. More probable symbols gain shorter codes, and the most probableusually ends up with a 1- or 2-bit code. Unknown symbols generateadditional codes, and if they've never occured before at all, getdumped literally.

Thus "Test" compresses from 4 to 6 - actually encoded as SYM_NYT "T"SYM_NYT "e" ... SYM_FLUSH, so there's 9 symbols in there.

"TestTestTestTest" on the other hand compresses from 12 to 9 -actually those last three "Test" strings compress to 1-bit per symboleach, because the compressor has learnt the notion.

Now, there is a point to all this, shocking as it may seem. What Iwanted to do was examine what happens if we change the symbol-set weinput into the compression engine, so (and you'll see this at the endof the attached Python code) I wrote out a simple bit of XMPP -actually a misleadingly simple bit.

A DEFLATE compressor should do well on this, because there's lots ofrepeated substrings.

My Huffman encoder, working on the string itself, brings it from 314to 165 - that's 52.55% - and DEFLATE manages to get 36.94%. So, themoral of the story here is that, on data which in principally textualin nature, like XML, you don't want to be using a low-order adaptiveHuffman encoder. But we all knew that anyway - dictionary compressorsare pretty spiffy on long repeated substrings, and DEFLATE uses asecondary (1st order, non-adaptive) Huffman.

But what happens if - instead of compressing the data as a string,the Huffman alphabet includes the events generated by parsing the XML?

When we compress "<presence/>" traditionally, we compress it using 11symbols - one for each letter. This is handy, as it means it genericfor any text, but as it happens, we don't have text - we have XML,and we can use that to our advantage.

So, instead, my compressor can compress it using symbols whichdescribe the XML rather than the textual representation of it - inthis case, amusingly, it'll make no difference, as it comes out asSYM_STARTTAG, "presence", SYM_ENDTAG, SYM_CLOSETAG. It makes slightlymore difference if we compress "<presence></presence>", as thestring-based compressor has nearly double the symbols, whereas aXML-based one has exactly the same.

The moment I do this, then I'm marginally beating DEFLATE - no meanfeat in half the memory, albeit it might just be a product of thedata.

I can also add in one more trick, which'll certainly increase thememory consumption, but ought to gain some serious savings - if Iallow the compressor to identify certain strings - like element andattribute names - that are likely to be repeated, and assign themsymbols, then in principle similar structures in XML should compressexceedingly well.

Testing shows that at least on the one fragment I've tried, there's aslim improvement - about 1.5% overall, or 3.7% relative to theXML-based compression.

The memory would increase, but on an entity with several sessions,these strings are likely to be shared between sessions, so I'm notentirely convinced I've broken the bank quite yet on memory usage,but it's certainly a more complex proposition to measure.

I'll adapt my script - and I make no apologies for how ugly the codeis or the quantity of obtuse debugging data it outputs - to read inactual XMPP transcripts and see how it performs. Meanwhile, have fun.


(Oh, and no, I've not written the XML-based decompressor, yet.)

Dave.
--
Dave Cridland - mailto:d...@cridland.net - xmpp:d...@dave.cridland.net
 - acap://acap.dave.cridland.net/byowner/user/dwd/bookmarks/
 - http://dave.cridland.net/
Infotrope Polymer - ACAP, IMAP, ESMTP, and Lemonade

# Input/output may be assumed to be structurally well-formed XML in UTF-8.
# On Tuesdays.
# (Summer only).

import xml.parsers.expat

SYM_NYT = 0
SYM_EOS = -1
SYM_FLUSH = -2

SYM_STARTTAG = -3
SYM_ATTRNAME = -4
SYM_ATTRVAL = -5
SYM_ENDTAG = -6
SYM_CLOSETAG = -7

SYM_DEFSYM = -8

SYM_MACRO_START = -100

class tree_node:
    def __init__(self, ch, number):
        self.sym = ch
        if self.sym is not None and not isinstance(self.sym,int):
            raise "Hell"
        if self.sym is not None and self.sym != 0:
            print "*create",`self.sym`
        self.parent = None
        self.left = None
        self.right = None
        self.weight = 0
        self.number = number
        
    def bump(self):
        self.weight += 1
            
    def set_right(self, t):
        self.right = t
        if t is not None:
            t.parent = self
        
    def set_left(self, t):
        self.left = t
        if t is not None:
            t.parent = self
        
    def swap(self, t):
        tmp = self.sym
        self.sym = t.sym
        t.sym = tmp
        
        tmp = self.left
        self.set_left(t.left)
        t.set_left(tmp)
        
        tmp = self.right
        self.set_right(t.right)
        t.set_right(tmp)
        
    def trace(self, bit):
        if not bit:
            return self.left
        return self.right

class tree:
    def __init__(self, sym):
        self.number = 0
        self.node_map = {}
        self.nodes = {}
        self.root = self.tree_node(SYM_NYT)
        self.sym = sym

    def tree_node(self, ch = None):
        t = tree_node(ch, self.number)
        self.nodes[self.number] = t
        self.number += 1
        if ch is not None:
            if ch in self.node_map:
                self.node_map[ch].sym = None
            self.node_map[ch] = t
        return t
            
    def update_nyt(self, ch):
        if ch == -2:
            print "*here"
        nyt = self.node_map[SYM_NYT]
        nsym = self.tree_node(ch)
        nnyt = self.tree_node(SYM_NYT)
        nyt.set_left(nnyt)
        nyt.set_right(nsym)
        nsym.bump()
        return nyt

    def update(self, ch):
        print "*update",`ch`,"in tree",`self.sym`
        if ch not in self.node_map:
            node = self.update_nyt(ch)
            if node is self.root:
                node.bump()
                return
        else:
            node = self.node_map[ch]
        while True:
            self.print_tree()
            print "*checking node",`node.number`
            lowest_node = None
            lowest_number = node.number
            for k,v in self.nodes.items():
                if v.weight == node.weight and v is not node.parent:
                    if k < lowest_number:
                        print "*equal weight and lower:",`v.number`
                        lowest_node = v
                        lowest_number = k
            if lowest_node is not None:
                print "*swap   ",`lowest_node.number`,`lowest_node.sym`,`node.number`,`node.sym`
                self.print_tree()
                lowest_node.swap(node)
                if lowest_node.sym is not None:
                    self.node_map[lowest_node.sym] = lowest_node
                if node.sym is not None:
                    self.node_map[node.sym] = node
                if node.parent is None:
                    self.root = node
                print "*swapped",`lowest_node.sym`,`node.sym`
                node = lowest_node
                self.print_tree()
            node.bump()
            if node is self.root:
                break
            node = node.parent
        self.print_tree()
    
    def print_tree(self, node=None, count=0):
        if node is None:
            node = self.root
        if count > 20:
            return
        print "%s%d <%d> [%s]" % (' ' * count, node.number, node.weight, `node.sym`)
        if node.left:
            self.print_tree(node.left, count+1)
            if node.left.parent is not node:
                print "ERROR!"
            self.print_tree(node.right, count+1)            
            if node.right.parent is not node:
                print "ERROR!"
            
    def find(self,ch):
        if ch not in self.node_map:
            return self.find(SYM_NYT)
        node = self.node_map[ch]
        path = []
        while node is not self.root:
            if node.parent.left is node:
                path.append(0)
            else:
                path.append(1)
            node = node.parent
        path.reverse()
        return path
    
    def is_nyt(self, ch):
        if ch in self.node_map:
            return False
        return True

class core:
    SYMS_LIKELY = 10
    SYMS_UNLIKELY = 0
    
    def __init__(self, dm=False):
        self.define_macros = dm
        self.state = []
        self.total = 0
        for sym in range(256):
            self.state.append(tree(sym))
        self.state[SYM_NYT].print_tree()
        self.state[SYM_NYT].update(SYM_EOS)
        self.state[SYM_NYT].print_tree()
        self.state[SYM_NYT].update(SYM_FLUSH)
        self.state[SYM_NYT].print_tree()
        if self.define_macros:
            self.state[SYM_NYT].update(SYM_DEFSYM)
        self.state[SYM_NYT].update(SYM_NYT)
        self.state[SYM_NYT].print_tree()
        self.macros = {}
        self.macro_next = SYM_MACRO_START
        self.last = SYM_NYT
        self.bitbuf = 0
        self.bitoff = 0
        self.out = ''
        self.xml = None
    
    def encode_sym(self, ch):
        tree = self.state[self.last]
        if tree.number == 1:
            tmp = self.last
            self.last = SYM_NYT
            self.encode_sym(ch)
            self.last = tmp
        else:
            bits = tree.find(ch)
            if bits:
                print "Encoding %d bits from tree %d" % (len(bits), tree.sym), `bits`
                self.output_bits(bits)
                cn = tree.root
                for bit in bits:
                    cn = cn.trace(bit)
                    if cn.sym is not None:
                        print "*sym",`cn.sym`
                if tree.is_nyt(ch):
                    if cn.sym != 0:
                        raise "Argh"
                elif cn.sym != ch:
                    raise "Argh2"
            else:
                print "No bits to encode from tree",`tree.sym`
                raise "No bits to encode?"
            if tree.is_nyt(ch):
                print "Encoding octet"
                self.output_octet(ch)
        self.state[self.last].update(ch)
        self.last = ch
    
    def encode_string(self, uu, last=False):
        self.encode_octets(uu)
        if last:
            self.encode_sym(SYM_EOS)
        else:
            self.encode_sym(SYM_FLUSH)
        self.output_flush()
        return self.out
    
    def encode_octets(self, uu):
        for ch in uu:
            sym = ord(ch)
            if sym == 0:
                break
            self.encode_sym(sym)
    
    def encode_name(self, name):
        if not self.define_macros:
            self.encode_octets(name)
            return
        if name not in self.macros:
            print "Defining macro for",`name`
            self.encode_sym(SYM_DEFSYM)
            self.encode_octets(name)
            m = self.macro_next
            self.macro_next -= 1
            self.macros[name] = m
            self.state[SYM_NYT].update(m)
            self.state[self.last].update(m)
        else:
            print "Using existing macro",`name`
            self.encode_sym(self.macros[name])
    
    def starttag(self, tag, attrs):
        self.encode_sym(SYM_STARTTAG)
        self.encode_name(tag)
        for k,v in attrs.items():
            self.encode_sym(SYM_ATTRNAME)
            self.encode_name(k)
            self.encode_sym(SYM_ATTRVAL)
            self.encode_octets(v)
        self.encode_sym(SYM_ENDTAG)
        
    def endtag(self, tag):
        self.encode_sym(SYM_CLOSETAG)
    
    def cdata(self, data):
        self.encode_octets(data)
    
    def encode_xml(self, xmldata, last=False):
        if self.xml is None:
            self.xml = xml.parsers.expat.ParserCreate()
            self.xml.StartElementHandler = self.starttag
            self.xml.EndElementHandler = self.endtag
            self.xml.CharacterDataHandler = self.cdata
            self.state[SYM_NYT].update(SYM_STARTTAG)
            self.state[SYM_NYT].update(SYM_ATTRNAME)
            self.state[SYM_NYT].update(SYM_ATTRVAL)
            self.state[SYM_NYT].update(SYM_ENDTAG)
            self.state[SYM_NYT].update(SYM_CLOSETAG)
            self.state[SYM_STARTTAG].update(SYM_DEFSYM)
            self.state[SYM_ATTRNAME].update(SYM_DEFSYM)
        self.xml.Parse(xmldata, 1)
        if last:
            self.encode_sym(SYM_EOS)
        else:
            self.encode_sym(SYM_FLUSH)
        self.output_flush()
        return self.out        
    
    def output_bits(self, bits):
        for b in bits:
            self.bitbuf <<= 1
            self.bitoff += 1
            self.bitbuf += b
            if self.bitoff == 8:
                self.out += chr(self.bitbuf)
                self.bitoff = 0
                self.bitbuf = 0
    
    def output_flush(self):
        while self.bitoff != 0:
            self.output_bits([0])
    
    def output_octet(self, ch):
        path = []
        while len(path) < 8:
            path.append(ch & 1)
            ch >>= 1
        path.reverse()
        self.output_bits(path)
        
    def decode(self, ss):
        last = SYM_NYT
        lastch = SYM_EOS
        literal_mode = False
        bc = 0
        bb = 0
        cn = self.state[SYM_NYT].root
        ba = []
        for ch in ss:
            bits = ord(ch)
            bitoff = (1 << 7)
            while bitoff > 0:
                bit = 0
                if (bitoff & bits):
                    bit = 1
                bitoff >>= 1
                if literal_mode:
                    bb <<= 1
                    bb += bit
                    bc += 1
                    if bc == 8:
                        literal_mode = False
                        if bb == SYM_EOS:
                            return self.out
                        self.out += chr(bb)
                        print "Literal done",`self.out`
                        self.state[SYM_NYT].update(bb)
                        if lastch > 0:
                            self.state[lastch].update(bb)
                        bc = 0
                        ba = []
                        last = bb
                        lastch = bb
                        cn = self.state[lastch].root
                        if self.state[lastch].number == 1:
                            #print "(Assuming NYT)"
                            last = SYM_NYT
                            cn = self.state[SYM_NYT].root
                else:
                    print "Bit is",`bit`,`self.out`,`cn.sym`
                    bc += 1
                    cn = cn.trace(bit)
                    ba.append(bit)
                    print "Move to",`cn.sym`
                    if cn.sym is not None:
                        #print "Have symbol"
                        print "Decoded %d bits" % bc, `ba`
                        bc = 0
                        ba = []
                        if cn.sym == SYM_EOS or cn.sym == SYM_FLUSH:
                            if cn.sym == SYM_EOS:
                                print "Found EOS"
                            else:
                                print "Found FLUSH"
                            return self.out
                        if cn.sym == SYM_NYT and last == SYM_NYT:
                            #print "Double NYT - literal mode"
                            bb = 0
                            bc = 0
                            literal_mode = True
                        else:
                            #print "Actual symbol"
                            if cn.sym > 0:
                                print "Printable symbol", chr(cn.sym)
                                self.out += chr(cn.sym)
                                print "Update",`lastch`,"for",`cn.sym`
                                self.state[lastch].update(cn.sym)
                                lastch = cn.sym
                            elif cn.sym == SYM_NYT:
                                print "Explicit NYT"
                            last = cn.sym
                            #print "Reset root"
                            cn = self.state[last].root
                            #print "Tree count:",`self.state[last].number`
                            if self.state[last].number == 1:
                                #print "Therefore Assuming NYT"
                                last = SYM_NYT
                                cn = self.state[SYM_NYT].root
                                #print "Tree count:",`self.state[SYM_NYT].number`
        print "ERROR"

if __name__ == '__main__':
    import sys
    import zlib

    # Check output routines.
    comp = core()
    comp.output_bits([0])
    comp.output_octet(ord('T'))
    comp.output_bits([1])
    comp.output_octet(ord('e'))
    comp.output_bits([1])
    comp.output_octet(ord('s'))
    comp.output_bits([1])
    comp.output_octet(ord('t'))
    comp.output_bits([0])
    comp.output_flush()
    
    print `comp.out`
    
    # Now check tree.
    print "\n**START\n"
    t = tree('0')
    print
    t.print_tree()
    print
    t.update(ord('a'))
    t.print_tree()
    print
    t.update(ord('a'))
    t.print_tree()
    print
    t.update(ord('r'))
    t.print_tree()
    print
    t.update(ord('d'))
    t.print_tree()
    print
    t.update(ord('v'))
    t.print_tree()
    print
    print
    print

    # Encode and decode an empty string.
    comp = core()
    print `comp.state[SYM_NYT].find(SYM_EOS)`
    print `comp.state[SYM_NYT].find(SYM_NYT)`
    comp.encode_string('')

    print `comp.out`
    print
    
    decomp = core()
    decomp.decode(comp.out)
    
    print `decomp.out`
    print

    # Same with "Test"
    comp = core()
    comp.encode_string('Test')
    
    print `comp.out`, 4, len(comp.out)
    print

    decomp = core()
    decomp.decode(comp.out)
    
    print `decomp.out`
    print

    # Longer string with repeats,
    # this exercises the adaptive 2nd order trees a bit.
    comp = core()
    comp.encode_string('TestTestTestTest')

    print `comp.out`, 12, len(comp.out)
    decomp = core()
    decomp.decode(comp.out)
    
    print `decomp.out`
    print
    
    # Now for real tests.
    # The following XML is used for all tests.
    test_xml = '''<?xml version="1.0"?>
<stream:stream>
<presence to='j...@jabber.org'><status>Here</status></presence>
<presence to='p...@jabber.org'><status>Here</status></presence>
<presence to='geo...@jabber.org'><status>Here</status></presence>
<presence to='ri...@jabber.org'><status>Here</status></presence>
</stream:stream>
'''
    
    comp = core()
    comp.encode_string(test_xml)
    
    test_xml_string = comp.out
    
    comp = core()
    comp.encode_xml(test_xml)
    
    test_xml_xml = comp.out
    
    comp = core(True)
    comp.encode_xml(test_xml)
    
    test_xml_xmld = comp.out
    
    test_xml_deflate = zlib.compress(test_xml, 9)
    
    print `len(test_xml)`, `len(test_xml_string)`, `len(test_xml_xml)`, `len(test_xml_xmld)`, `len(test_xml_deflate)`
    print "%02.2f %02.2f %02.2f %02.2f %02.2f" % (len(test_xml) * 100.0 / len(test_xml), len(test_xml_string) * 100.0 / len(test_xml), len(test_xml_xml) * 100.0 / len(test_xml), len(test_xml_xmld) * 100.0 / len(test_xml), len(test_xml_deflate) * 100.0 / len(test_xml))
    print "%02.2f %02.2f %02.2f %02.2f %02.2f" % (len(test_xml) * 100.0 / len(test_xml_string), len(test_xml_string) * 100.0 / len(test_xml_string), len(test_xml_xml) * 100.0 / len(test_xml_string), len(test_xml_xmld) * 100.0 / len(test_xml_string), len(test_xml_deflate) * 100.0 / len(test_xml_string))
    print "%02.2f %02.2f %02.2f %02.2f %02.2f" % (len(test_xml) * 100.0 / len(test_xml_xml), len(test_xml_string) * 100.0 / len(test_xml_xml), len(test_xml_xml) * 100.0 / len(test_xml_xml), len(test_xml_xmld) * 100.0 / len(test_xml_xml), len(test_xml_deflate) * 100.0 / len(test_xml_xml))

[Standards] Fun and games with Huffman

Reply via email to