I figured I talk enough about Patricia that I would benefit from experimenting with it to improve my understanding of it. So here's a (somewhat simplified) version of the Patricia algorithm in Python. Normal Patricia implementations work differently in the following ways:
- they index suffixes of a larger string and therefore store pointers into the string rather than having the string stuck there. - they compress the storage of the bit indices (and support longer strings) by storing only differences between successive bit indices in the nodes. - they reduce allocation overhead by storing the terminal nodes inside other nodes in the tree. - they only traverse the tree once to insert a new string, rather than twice. - less significantly, they usually count bits in big-endian order, making alphabetically nearby strings likely to be close together in the tree. Here are profiling results for inserting each line of /usr/share/dict/words on my 366MHz PII: >>> profile.run("for word in words[1:]: insert_new_string(wordtree, word)") 1592123 function calls in 139.860 CPU seconds Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) 1 1.950 1.950 139.850 139.850 <string>:1(?) 45403 27.850 0.001 53.360 0.001 pat.py:44(patricia_search) 1365106 59.710 0.000 59.710 0.000 pat.py:6(getbit) 45403 3.820 0.000 5.810 0.000 pat.py:61(insert_new_string_at) 45403 11.560 0.000 20.850 0.000 pat.py:71(first_bit_difference) 45403 25.960 0.001 48.880 0.001 pat.py:79(node_that_skipped_it) 45403 9.000 0.000 137.900 0.003 pat.py:87(insert_new_string) 1 0.010 0.010 139.860 139.860 profile:0(for word in words[1:]: insert_new_string(wordtree, word)) 0 0.000 0.000 profile:0(profiler) The Python process was using 7.8MB more memory than immediately after its startup after building this tree, or about 175 bytes per word; and that it was able to insert 325 words per second, moderately down from an earlier test in which it inserted 370 words per second for 1.89 seconds (700 words, up to a total of 1000.) Note that this is roughly one million instructions per string inserted, and this is a fairly small number of strings. This is why the above enhancements to the algorithm are desirable and why Python is not a desirable language in which to implement high-performance algorithms. To look at it another way, these million instructions amount to about 790 times the function-call overhead in Python on my machine. The work per string should increase as the log2 of the number of strings; for the above two tests, that was about 9 and about 15, which predicts a larger slowdown for the 40K-word case than I actually saw. Here's a quick cheap-junk spell-check program that uses the Patricia index of /usr/share/dict/words to find "possible misspellings" in the above couple of paragraphs, and its rather amusing results: >>> for word in string.split(paragraphs): ... word += "\n" ... if not lu(word) == word: print `word`, `lu(word)` ... 'Note\n' 'Nottingham\n' 'inserted,\n' 'inserted\n' 'a\n' 'ajar\n' 'strings.\n' 'strings\n' 'This\n' 'Thiensville\n' 'Python\n' 'Pythagoras\n' 'a\n' 'ajar\n' 'high-performance\n' 'higher\n' 'algorithms.\n' 'algorithms\n' 'To\n' 'Tobago\n' 'way,\n' 'way\n' '790\n' 'wiping\n' 'function-call\n' 'functioned\n' 'Python\n' 'Pythagoras\n' 'machine.\n' 'machine\n' 'The\n' 'Thebes\n' 'log2\n' 'logjam\n' 'strings;\n' 'strings\n' 'tests,\n' 'tests\n' '9\n' 'yonder\n' '15,\n' 'queried\n' 'a\n' 'ajar\n' '40K-word\n' 'though\n' 'I\n' 'Izvestia\n' 'saw.\n' 'sawfish\n' (Yes, "a" and "I" are not in /usr/share/dict/words.) #!/usr/bin/python # patricia trees. trace_getbit = 0 def getbit(astring, index): "Get bit N from astring, little-endianly." if trace_getbit: print "getting bit %d from %s" % (index, `astring`) # When walking the tree looking for a key not yet in there, # we may find ourselves walking among nodes whose keys are longer # than our own. In an ordinary lookup situation, we could return # "not found" at this point, but if we're going to find a string # that matches "as closely as possible" so we can find the first # mismatched bit, it's important to find *something*. So we just # return zeroes. try: relevantchar = astring[index / 8] except IndexError: return 0 return not not (ord(relevantchar) & (1 << (index % 8))) assert getbit("A", 0) assert not getbit("A", 1) assert not getbit("A", 2) assert getbit("A", 6) assert not getbit("A", 5) assert getbit("a", 5) assert getbit("a", 6) assert not getbit(" a", 6) assert getbit(" a", 14) assert getbit(" A", 5) assert not getbit(" A", 13) # an internal Patricia treenode has a bit index and left and right # children. An external Patricia treenode has a string. Here we # represent internal nodes as [bitindex, left, right] lists and external # nodes as lists containing just the string. # Traditionally, each node contains only a count of bits by which to # advance, not an absolute bit index. That makes the algorithm # hairier, especially the update logic, but it makes it possible to # handle larger strings. sample_tree = [0, [1, ['hello'], ['boo']], [3, ['underwear'], ['mary']]] def patricia_search(patnode, astring): while 1: if len(patnode) == 1: return patnode[0] bitindex, left, right = patnode if getbit(astring, bitindex): patnode = right else: patnode = left def lu(word): return patricia_search(sample_tree, word) # So how do we construct such a tree from a list of strings? # Well, for no strings, there is no way. You lose. # For one string, it is just [string]. # For two strings, we start with [string1]. Then we do a search as above and # discover that string2 is not in the tree, so we figure out that the two # strings differ at bit N, so we need to insert a node into string1's search # path at bit N, with its other alternative pointing at [string2]. def insert_new_string_at(patnode, newstring, bitindex): copy = patnode[:] newnode = [newstring] if getbit(newstring, bitindex): patnode[:] = [bitindex, copy, newnode] else: patnode[:] = [bitindex, newnode, copy] # that gets called as # insert_new_string_at(nodethatskippedit, newstring, # bitindexofdifference) def first_bit_difference(string1, string2): for charindex in xrange(min(len(string1), len(string2))): if string1[charindex] != string2[charindex]: bitstart = charindex * 8 for bitindex in range(bitstart, bitstart + 8): if getbit(string1, bitindex) != getbit(string2, bitindex): return bitindex def node_that_skipped_it(patnode, newstring, difference_index): while 1: if len(patnode) == 1: return patnode bitindex, left, right = patnode if bitindex > difference_index: return patnode if getbit(newstring, bitindex): patnode = right else: patnode = left def insert_new_string(patnode, newstring): bitindex = first_bit_difference(newstring, patricia_search(patnode, newstring)) insert_new_string_at(node_that_skipped_it(patnode, newstring, bitindex), newstring, bitindex) -- <[EMAIL PROTECTED]> Kragen Sitaker <http://www.pobox.com/~kragen/> To forget the evil in oneself is to turn one's own good -- now untethered from modesty and rendered tyrannical -- into a magnified power for evil. -- Steve Talbot, NETFUTURE #129, via SMART Letter