So, I've been playing with compression algorithms, by way of fun and
games.
In particular, I've built a 2nd-order adaptive Huffman encoder in
Python. It's not designed to be particularly efficient, but it's
possible to build one of these in about 64k or so, I think - that's
considerably less than the 256k that a DEFLATE compressor uses. (A
zlib-based codec costs 300k total at maximum compression, so it's not
equal, whereas this would be.)
The simple explanation of how it works is that it picks an encoding
based on the probability so far of the symbol occuring given the last
one. More probable symbols gain shorter codes, and the most probable
usually ends up with a 1- or 2-bit code. Unknown symbols generate
additional codes, and if they've never occured before at all, get
dumped literally.
Thus "Test" compresses from 4 to 6 - actually encoded as SYM_NYT "T"
SYM_NYT "e" ... SYM_FLUSH, so there's 9 symbols in there.
"TestTestTestTest" on the other hand compresses from 12 to 9 -
actually those last three "Test" strings compress to 1-bit per symbol
each, because the compressor has learnt the notion.
Now, there is a point to all this, shocking as it may seem. What I
wanted to do was examine what happens if we change the symbol-set we
input into the compression engine, so (and you'll see this at the end
of the attached Python code) I wrote out a simple bit of XMPP -
actually a misleadingly simple bit.
A DEFLATE compressor should do well on this, because there's lots of
repeated substrings.
My Huffman encoder, working on the string itself, brings it from 314
to 165 - that's 52.55% - and DEFLATE manages to get 36.94%. So, the
moral of the story here is that, on data which in principally textual
in nature, like XML, you don't want to be using a low-order adaptive
Huffman encoder. But we all knew that anyway - dictionary compressors
are pretty spiffy on long repeated substrings, and DEFLATE uses a
secondary (1st order, non-adaptive) Huffman.
But what happens if - instead of compressing the data as a string,
the Huffman alphabet includes the events generated by parsing the XML?
When we compress "<presence/>" traditionally, we compress it using 11
symbols - one for each letter. This is handy, as it means it generic
for any text, but as it happens, we don't have text - we have XML,
and we can use that to our advantage.
So, instead, my compressor can compress it using symbols which
describe the XML rather than the textual representation of it - in
this case, amusingly, it'll make no difference, as it comes out as
SYM_STARTTAG, "presence", SYM_ENDTAG, SYM_CLOSETAG. It makes slightly
more difference if we compress "<presence></presence>", as the
string-based compressor has nearly double the symbols, whereas a
XML-based one has exactly the same.
The moment I do this, then I'm marginally beating DEFLATE - no mean
feat in half the memory, albeit it might just be a product of the
data.
I can also add in one more trick, which'll certainly increase the
memory consumption, but ought to gain some serious savings - if I
allow the compressor to identify certain strings - like element and
attribute names - that are likely to be repeated, and assign them
symbols, then in principle similar structures in XML should compress
exceedingly well.
Testing shows that at least on the one fragment I've tried, there's a
slim improvement - about 1.5% overall, or 3.7% relative to the
XML-based compression.
The memory would increase, but on an entity with several sessions,
these strings are likely to be shared between sessions, so I'm not
entirely convinced I've broken the bank quite yet on memory usage,
but it's certainly a more complex proposition to measure.
I'll adapt my script - and I make no apologies for how ugly the code
is or the quantity of obtuse debugging data it outputs - to read in
actual XMPP transcripts and see how it performs. Meanwhile, have fun.
(Oh, and no, I've not written the XML-based decompressor, yet.)
Dave.
--
Dave Cridland - mailto:d...@cridland.net - xmpp:d...@dave.cridland.net
- acap://acap.dave.cridland.net/byowner/user/dwd/bookmarks/
- http://dave.cridland.net/
Infotrope Polymer - ACAP, IMAP, ESMTP, and Lemonade
# Input/output may be assumed to be structurally well-formed XML in UTF-8.
# On Tuesdays.
# (Summer only).
import xml.parsers.expat
SYM_NYT = 0
SYM_EOS = -1
SYM_FLUSH = -2
SYM_STARTTAG = -3
SYM_ATTRNAME = -4
SYM_ATTRVAL = -5
SYM_ENDTAG = -6
SYM_CLOSETAG = -7
SYM_DEFSYM = -8
SYM_MACRO_START = -100
class tree_node:
def __init__(self, ch, number):
self.sym = ch
if self.sym is not None and not isinstance(self.sym,int):
raise "Hell"
if self.sym is not None and self.sym != 0:
print "*create",`self.sym`
self.parent = None
self.left = None
self.right = None
self.weight = 0
self.number = number
def bump(self):
self.weight += 1
def set_right(self, t):
self.right = t
if t is not None:
t.parent = self
def set_left(self, t):
self.left = t
if t is not None:
t.parent = self
def swap(self, t):
tmp = self.sym
self.sym = t.sym
t.sym = tmp
tmp = self.left
self.set_left(t.left)
t.set_left(tmp)
tmp = self.right
self.set_right(t.right)
t.set_right(tmp)
def trace(self, bit):
if not bit:
return self.left
return self.right
class tree:
def __init__(self, sym):
self.number = 0
self.node_map = {}
self.nodes = {}
self.root = self.tree_node(SYM_NYT)
self.sym = sym
def tree_node(self, ch = None):
t = tree_node(ch, self.number)
self.nodes[self.number] = t
self.number += 1
if ch is not None:
if ch in self.node_map:
self.node_map[ch].sym = None
self.node_map[ch] = t
return t
def update_nyt(self, ch):
if ch == -2:
print "*here"
nyt = self.node_map[SYM_NYT]
nsym = self.tree_node(ch)
nnyt = self.tree_node(SYM_NYT)
nyt.set_left(nnyt)
nyt.set_right(nsym)
nsym.bump()
return nyt
def update(self, ch):
print "*update",`ch`,"in tree",`self.sym`
if ch not in self.node_map:
node = self.update_nyt(ch)
if node is self.root:
node.bump()
return
else:
node = self.node_map[ch]
while True:
self.print_tree()
print "*checking node",`node.number`
lowest_node = None
lowest_number = node.number
for k,v in self.nodes.items():
if v.weight == node.weight and v is not node.parent:
if k < lowest_number:
print "*equal weight and lower:",`v.number`
lowest_node = v
lowest_number = k
if lowest_node is not None:
print "*swap ",`lowest_node.number`,`lowest_node.sym`,`node.number`,`node.sym`
self.print_tree()
lowest_node.swap(node)
if lowest_node.sym is not None:
self.node_map[lowest_node.sym] = lowest_node
if node.sym is not None:
self.node_map[node.sym] = node
if node.parent is None:
self.root = node
print "*swapped",`lowest_node.sym`,`node.sym`
node = lowest_node
self.print_tree()
node.bump()
if node is self.root:
break
node = node.parent
self.print_tree()
def print_tree(self, node=None, count=0):
if node is None:
node = self.root
if count > 20:
return
print "%s%d <%d> [%s]" % (' ' * count, node.number, node.weight, `node.sym`)
if node.left:
self.print_tree(node.left, count+1)
if node.left.parent is not node:
print "ERROR!"
self.print_tree(node.right, count+1)
if node.right.parent is not node:
print "ERROR!"
def find(self,ch):
if ch not in self.node_map:
return self.find(SYM_NYT)
node = self.node_map[ch]
path = []
while node is not self.root:
if node.parent.left is node:
path.append(0)
else:
path.append(1)
node = node.parent
path.reverse()
return path
def is_nyt(self, ch):
if ch in self.node_map:
return False
return True
class core:
SYMS_LIKELY = 10
SYMS_UNLIKELY = 0
def __init__(self, dm=False):
self.define_macros = dm
self.state = []
self.total = 0
for sym in range(256):
self.state.append(tree(sym))
self.state[SYM_NYT].print_tree()
self.state[SYM_NYT].update(SYM_EOS)
self.state[SYM_NYT].print_tree()
self.state[SYM_NYT].update(SYM_FLUSH)
self.state[SYM_NYT].print_tree()
if self.define_macros:
self.state[SYM_NYT].update(SYM_DEFSYM)
self.state[SYM_NYT].update(SYM_NYT)
self.state[SYM_NYT].print_tree()
self.macros = {}
self.macro_next = SYM_MACRO_START
self.last = SYM_NYT
self.bitbuf = 0
self.bitoff = 0
self.out = ''
self.xml = None
def encode_sym(self, ch):
tree = self.state[self.last]
if tree.number == 1:
tmp = self.last
self.last = SYM_NYT
self.encode_sym(ch)
self.last = tmp
else:
bits = tree.find(ch)
if bits:
print "Encoding %d bits from tree %d" % (len(bits), tree.sym), `bits`
self.output_bits(bits)
cn = tree.root
for bit in bits:
cn = cn.trace(bit)
if cn.sym is not None:
print "*sym",`cn.sym`
if tree.is_nyt(ch):
if cn.sym != 0:
raise "Argh"
elif cn.sym != ch:
raise "Argh2"
else:
print "No bits to encode from tree",`tree.sym`
raise "No bits to encode?"
if tree.is_nyt(ch):
print "Encoding octet"
self.output_octet(ch)
self.state[self.last].update(ch)
self.last = ch
def encode_string(self, uu, last=False):
self.encode_octets(uu)
if last:
self.encode_sym(SYM_EOS)
else:
self.encode_sym(SYM_FLUSH)
self.output_flush()
return self.out
def encode_octets(self, uu):
for ch in uu:
sym = ord(ch)
if sym == 0:
break
self.encode_sym(sym)
def encode_name(self, name):
if not self.define_macros:
self.encode_octets(name)
return
if name not in self.macros:
print "Defining macro for",`name`
self.encode_sym(SYM_DEFSYM)
self.encode_octets(name)
m = self.macro_next
self.macro_next -= 1
self.macros[name] = m
self.state[SYM_NYT].update(m)
self.state[self.last].update(m)
else:
print "Using existing macro",`name`
self.encode_sym(self.macros[name])
def starttag(self, tag, attrs):
self.encode_sym(SYM_STARTTAG)
self.encode_name(tag)
for k,v in attrs.items():
self.encode_sym(SYM_ATTRNAME)
self.encode_name(k)
self.encode_sym(SYM_ATTRVAL)
self.encode_octets(v)
self.encode_sym(SYM_ENDTAG)
def endtag(self, tag):
self.encode_sym(SYM_CLOSETAG)
def cdata(self, data):
self.encode_octets(data)
def encode_xml(self, xmldata, last=False):
if self.xml is None:
self.xml = xml.parsers.expat.ParserCreate()
self.xml.StartElementHandler = self.starttag
self.xml.EndElementHandler = self.endtag
self.xml.CharacterDataHandler = self.cdata
self.state[SYM_NYT].update(SYM_STARTTAG)
self.state[SYM_NYT].update(SYM_ATTRNAME)
self.state[SYM_NYT].update(SYM_ATTRVAL)
self.state[SYM_NYT].update(SYM_ENDTAG)
self.state[SYM_NYT].update(SYM_CLOSETAG)
self.state[SYM_STARTTAG].update(SYM_DEFSYM)
self.state[SYM_ATTRNAME].update(SYM_DEFSYM)
self.xml.Parse(xmldata, 1)
if last:
self.encode_sym(SYM_EOS)
else:
self.encode_sym(SYM_FLUSH)
self.output_flush()
return self.out
def output_bits(self, bits):
for b in bits:
self.bitbuf <<= 1
self.bitoff += 1
self.bitbuf += b
if self.bitoff == 8:
self.out += chr(self.bitbuf)
self.bitoff = 0
self.bitbuf = 0
def output_flush(self):
while self.bitoff != 0:
self.output_bits([0])
def output_octet(self, ch):
path = []
while len(path) < 8:
path.append(ch & 1)
ch >>= 1
path.reverse()
self.output_bits(path)
def decode(self, ss):
last = SYM_NYT
lastch = SYM_EOS
literal_mode = False
bc = 0
bb = 0
cn = self.state[SYM_NYT].root
ba = []
for ch in ss:
bits = ord(ch)
bitoff = (1 << 7)
while bitoff > 0:
bit = 0
if (bitoff & bits):
bit = 1
bitoff >>= 1
if literal_mode:
bb <<= 1
bb += bit
bc += 1
if bc == 8:
literal_mode = False
if bb == SYM_EOS:
return self.out
self.out += chr(bb)
print "Literal done",`self.out`
self.state[SYM_NYT].update(bb)
if lastch > 0:
self.state[lastch].update(bb)
bc = 0
ba = []
last = bb
lastch = bb
cn = self.state[lastch].root
if self.state[lastch].number == 1:
#print "(Assuming NYT)"
last = SYM_NYT
cn = self.state[SYM_NYT].root
else:
print "Bit is",`bit`,`self.out`,`cn.sym`
bc += 1
cn = cn.trace(bit)
ba.append(bit)
print "Move to",`cn.sym`
if cn.sym is not None:
#print "Have symbol"
print "Decoded %d bits" % bc, `ba`
bc = 0
ba = []
if cn.sym == SYM_EOS or cn.sym == SYM_FLUSH:
if cn.sym == SYM_EOS:
print "Found EOS"
else:
print "Found FLUSH"
return self.out
if cn.sym == SYM_NYT and last == SYM_NYT:
#print "Double NYT - literal mode"
bb = 0
bc = 0
literal_mode = True
else:
#print "Actual symbol"
if cn.sym > 0:
print "Printable symbol", chr(cn.sym)
self.out += chr(cn.sym)
print "Update",`lastch`,"for",`cn.sym`
self.state[lastch].update(cn.sym)
lastch = cn.sym
elif cn.sym == SYM_NYT:
print "Explicit NYT"
last = cn.sym
#print "Reset root"
cn = self.state[last].root
#print "Tree count:",`self.state[last].number`
if self.state[last].number == 1:
#print "Therefore Assuming NYT"
last = SYM_NYT
cn = self.state[SYM_NYT].root
#print "Tree count:",`self.state[SYM_NYT].number`
print "ERROR"
if __name__ == '__main__':
import sys
import zlib
# Check output routines.
comp = core()
comp.output_bits([0])
comp.output_octet(ord('T'))
comp.output_bits([1])
comp.output_octet(ord('e'))
comp.output_bits([1])
comp.output_octet(ord('s'))
comp.output_bits([1])
comp.output_octet(ord('t'))
comp.output_bits([0])
comp.output_flush()
print `comp.out`
# Now check tree.
print "\n**START\n"
t = tree('0')
print
t.print_tree()
print
t.update(ord('a'))
t.print_tree()
print
t.update(ord('a'))
t.print_tree()
print
t.update(ord('r'))
t.print_tree()
print
t.update(ord('d'))
t.print_tree()
print
t.update(ord('v'))
t.print_tree()
print
print
print
# Encode and decode an empty string.
comp = core()
print `comp.state[SYM_NYT].find(SYM_EOS)`
print `comp.state[SYM_NYT].find(SYM_NYT)`
comp.encode_string('')
print `comp.out`
print
decomp = core()
decomp.decode(comp.out)
print `decomp.out`
print
# Same with "Test"
comp = core()
comp.encode_string('Test')
print `comp.out`, 4, len(comp.out)
print
decomp = core()
decomp.decode(comp.out)
print `decomp.out`
print
# Longer string with repeats,
# this exercises the adaptive 2nd order trees a bit.
comp = core()
comp.encode_string('TestTestTestTest')
print `comp.out`, 12, len(comp.out)
decomp = core()
decomp.decode(comp.out)
print `decomp.out`
print
# Now for real tests.
# The following XML is used for all tests.
test_xml = '''<?xml version="1.0"?>
<stream:stream>
<presence to='j...@jabber.org'><status>Here</status></presence>
<presence to='p...@jabber.org'><status>Here</status></presence>
<presence to='geo...@jabber.org'><status>Here</status></presence>
<presence to='ri...@jabber.org'><status>Here</status></presence>
</stream:stream>
'''
comp = core()
comp.encode_string(test_xml)
test_xml_string = comp.out
comp = core()
comp.encode_xml(test_xml)
test_xml_xml = comp.out
comp = core(True)
comp.encode_xml(test_xml)
test_xml_xmld = comp.out
test_xml_deflate = zlib.compress(test_xml, 9)
print `len(test_xml)`, `len(test_xml_string)`, `len(test_xml_xml)`, `len(test_xml_xmld)`, `len(test_xml_deflate)`
print "%02.2f %02.2f %02.2f %02.2f %02.2f" % (len(test_xml) * 100.0 / len(test_xml), len(test_xml_string) * 100.0 / len(test_xml), len(test_xml_xml) * 100.0 / len(test_xml), len(test_xml_xmld) * 100.0 / len(test_xml), len(test_xml_deflate) * 100.0 / len(test_xml))
print "%02.2f %02.2f %02.2f %02.2f %02.2f" % (len(test_xml) * 100.0 / len(test_xml_string), len(test_xml_string) * 100.0 / len(test_xml_string), len(test_xml_xml) * 100.0 / len(test_xml_string), len(test_xml_xmld) * 100.0 / len(test_xml_string), len(test_xml_deflate) * 100.0 / len(test_xml_string))
print "%02.2f %02.2f %02.2f %02.2f %02.2f" % (len(test_xml) * 100.0 / len(test_xml_xml), len(test_xml_string) * 100.0 / len(test_xml_xml), len(test_xml_xml) * 100.0 / len(test_xml_xml), len(test_xml_xmld) * 100.0 / len(test_xml_xml), len(test_xml_deflate) * 100.0 / len(test_xml_xml))