Herewith a passel of algorithmic musings, all untested, and all uninformed by previous work: some approaches to how to make fuzzy file signatures to allow you to quickly find similar files, specifically involving hashing, vector math, substrings of various sizes, bitvectors, Bloom filters, emphasizing more unusual strings, and discovering which strings are unusual in a corpus.
If one of these approaches works, it could be useful for sorting blobs in git packs to improve compression, for inferring cladistic trees from genome data, for filtering spam, for detecting viruses, for discovering copyright violation and plagiarism, for network intrusion detection, for forensic examination of disks, and for maintaining a link to, say, a paragraph within a document that is being edited (as in Queer Numbers). The previous work I know about in this area is spamsum, CTPH, FTK, and ssdeep. They use a different approach. I don't know if this approach is likely to be better or worse. It certainly ought to be able to find much looser similarities between files (e.g. emails written by different people in the same language.) Fuzzy file signatures for fuzzy "hashing" ----------------------------------------- 11 years ago, I tried this failed experiment: <http://lists.canonical.org/pipermail/kragen-hacks/1999-February/000157.html> In quick summary, the idea was to compute a compact "signature" of a file that could be compared against another "signature" to see if the files were not only identical, but similar, by hashing many substrings of the file and comparing the counts in the hash tables. In pseudocode: for each file f: hashtable[f] = [0 for i = 0 to tablesize] for i = 0 to f.length - n: bucket = hash(f.contents[i : i+n]) % tablesize hashtable[f][bucket]++ magnitude = { f: sqrt(sum(val**2 for val in hashtable[f])) for each file f } for each file f: for each file g: normalization = magnitude[f] * magnitude[g] dotproduct = sum(hashtable[f][i] * hashtable[g][i] for i = 0 to tablesize) similarity[f, g] = dotproduct / normalization It didn't work because big files with small hash tables tended to push all the hash buckets toward being more or less equal. I was lying awake at night listening to the wind, and some ideas for improving the approach came to me. Both positive and negative hashes --------------------------------- The basic problem of bias toward big files can be solved by decrementing the buckets for some hash strings and incrementing them for others. Instead of bucket = hash(f.contents[i : i+n]) % tablesize hashtable[f][bucket]++ we have something like bucket = hash(f.contents[i : i+n]) % (2 * tablesize) hashtable[f][bucket % tablesize] += (1 if bucket < tablesize else -1) This should make the directions of the file vectors distribute nicely all around the hypersphere, which should make the normalized-dot-product approach less ineffective. Various sizes of substrings --------------------------- If two files are different points in time of the same text file, they'll probably have lots of 128-byte chunks in common, not just 16-byte chunks. On the other hand, if two files are merely written in the same natural language (and character encoding), they'll have hardly any 128-byte chunks in common, but probably lots of 8-byte chunks. But looking only at 8-byte chunks, particularly hashed, is insufficient for identifying two files as different versions of the same file --- at that granularity, they'll hardly look any more similar than any other two random files from the same language. So a *general-purpose* file similarity checker might use several different sizes of substring: 4-byte, 16-byte, and 64-byte, say. Bloom filters and randomly discarding data ------------------------------------------ The hoped-for result here is signature vectors of a very high dimensionality, such that the expected correlation between two unrelated vectors will be very small. To increase dimensionality with a fixed memory footprint, you use fewer bits per hash bucket. The extreme of this approach is to use a bitvector rather than a vector of integers, rather like a Bloom filter --- although I think the probability of false negatives isn't relevant, so there's probably no need to use more than one hash function. (Sometimes machine learning people talk about the "curse of dimensionality": if your dataset has a lot of dimensions, it can be hard to predict anything from it because, essentially, the points in the dataset all tend to be on or near the convex hull. I don't think the curse of dimensionality applies here; really, the more dimensions we have here, the better, subject to the limitations of computing power.) A difficulty with this approach is that your bitvector needs to be large if you don't want it to be all ones. One solution to this is the both-positive-and-negative-hashes thing mentioned previously, but that still leaves the problem that the bitvector you have at the end of the file is really only the product of the last few dozen or hundred bytes of the file. A different, and perhaps better, approach is to use a very large bitvector while you're hashing the file, but then discard most of the bitvector when you finish. If the hash function is properly random, the first 64 bits of your bitvector will still be expected to be pretty uncorrelated with the first 64 bits of some other file's bitvector, but pretty correlated with the first 64 bits of a bitvector from another version of the same file. (I think you can calculate that correlation with two instructions on modern amd64 CPUs.) Ideally, you'd like those bits to be about half 1s and half 0s, so that the expected correlation with an uncorrelated vector is 32 of those 64 bits. If all your files were the same size, you could achieve this by choosing the right size for your large bitvector, But what if all the files aren't the same size, or what if some of them have a lot more repetition than others? Some of them will have hundreds of times more hash values than others. Well, suppose your method of cutting down your raw hash value to your hash table size is simply to shift off the N low-order bits of the hash. This essentially merges adjacent hash buckets, by twos, fours, eights, however many are needed. Well, in that case, you can do that after the fact, too! It's slightly easier if you shift "off" the N low-order bits that don't index into a word, by ORing adjacent words together until you have enough bits set for your desired signature density. In this approach, you don't even need to maintain the whole huge bitvector physically in memory; except for occasional pathological cases, such as empty files or huge blocks of zero bytes, you can be assured of finding 32 separate bits to set in your 64-bit signature word within only the first few hundred signature words. So this signature creation algorithm looks something like this: bits_per_word = 64 virtual_hash_words = 65536 physical_hash_words = 256 bitvectors = [0 for i = 0 to physical_hash_words] for i = 0 to file.length - n: bit_index = hash(file.contents[i : i+n]) % (virtual_hash_words * bits_per_word) (bitshift, word_index) = (bit_index % bits_per_word, bit_index / bits_per_word) if word_index < physical_hash_words: bitvectors[word_index] |= 1 << bitshift signature = 0 i = 0 while popcount(signature) < bits_per_word / 2 and i < physical_hash_words: signature |= bitvectors[i++] The numbers above should suffice for a relatively wide range of file sizes --- those with only about 40 distinct hash values falling into the 256, which I think is files above about 16 kilobytes, up to those with about 40 distinct hash values falling into the first word, which I think is files about 256 times as large, or about four megabytes. If you hash three distinct lengths of strings, divide those file sizes by three. That's still not a very large range of file sizes, though. There's a hack that vastly expands this, though. Rather than this: if word_index < physical_hash_words: bitvectors[word_index] |= 1 << bitshift we hack the word_index logarithmically: bitvectors[logarithmic_hack(word_index)] |= 1 << bitshift where `logarithmic_hack` maps 0 => 0, 1 => 1, 2 => 1, 3..7 => 2, 8..15 => 3, and so on, which I guess means that it's just `floor(log(word_index)/log(2))`, an efficient algorithm for which can almost certainly be found in Hacker's Delight. This would allow this algorithm to scale all the way from empty files and files of one byte all the way up to exabyte-sized files while using only 4k or so of storage to calculate the signature. Long files will not appear to be similar to short files with this approach, unless the long files happen to consist almost entirely of bits of the short file repeated over and over again. I think that's okay. This approach resembles Pentti Kanerva's associative storage scheme that he calls "fully distributed representation", which as I recall is a kind of "spatter code". He suggested that the task of finding which known codeword from a large collection has the closest Hamming distance to a search-key codeword could perhaps be handled more efficiently by a neural network than by sequential search. I don't know enough about neural networks to comment on whether he's right. Preferring more unusual strings ------------------------------- Unfortunately, most of the data going into any of these algorithms I've written about above is the hash values of very common strings, strings with poor discriminatory power. In the bitvector/data-discarding approach, this means you're actually discarding the data you need and it doesn't even make it into the signature, while in the dot-product approach, it may merely be lost in the noise getting added to all the components of the vector. If you had a broad signature of "ordinary files", say a Bloom filter of tens of gigabytes of common strings, you could avoid putting those common strings into your signatures and thus boost their resolving power, through a sort of TF/IDF. This implies, however, a kind of model of the process generating your files, and if that model is inaccurate, it could result in this approach doing worse than the baseline. Preferring more common strings in the Bloom filter -------------------------------------------------- The other half of TF/IDF is TF: valuing more common terms more. If you used a three-bit counting Bloom filter, for example, you could avoid putting hash values into your signature that occurred less than 7 times, thus emphasizing the strings that occurred many times in the file instead of those whose hash values happened to be small. Unfortunately, this multiplies the memory requirements of the hashing stage; if you want to be able to count higher than N in any hash bucket, you must have lg N bits per hash bucket. However, this isn't necessarily true if you're willing to accept errors and count approximately. In four bits, for example, you can construct a counter that goes to 256, approximately; simply randomly fail to carry half the time. The least significant bit will toggle, on average, once for every two times you try to increment the counter, which means it will transition from high to low once every four times; the next-least-significant bit will toggle, on average, once for every two times the LSB transitions high-to-low, which means once every eight inputs; and so on. (It may be a good idea to have less-significant bits toggle with lower probability than more-significant bits, since that will give you less error for the same expected counting range in the same number of bits. The optimum may in fact be to toggle the LSB with low probability while incrementing the rest of the bits in the usual way, which is equivalent to simply doing an exact count of a random sample of the input.) There's no theoretical limit to how high you can count in an arbitrarily small number of bits with this method, but of course more bits will produce more precise results. This approach would allow you to compute (an approximate upper bound on) the corpus frequency of a given word using a compact, fixed-size data structure, and similarly for the frequency within a given document. Tournaments to find more common strings --------------------------------------- (I'm pretty sure I posted some nonworking code for this to kragen-hacks once too.) Suppose you want to find the most common strings (I'll say "words" hereafter in this section) in a long file. The simplest way is to make a file of all the words, sort it, and count the duplicates, but this requires many passes over the file and potentially a lot more space than the original input. Instead, you could stream through the file, maintaining a bounded-size heap of the approximately most common words you've seen so far, ordered by approximate frequency. The most common word would be at the top of the heap, while the less common words would be further down the heap, with the least common words at the bottom. Until the heap hits its size bound, all the numbers in the heap are exact. Once it fills up, though, to insert a new word into the bottom of the heap requires kicking an old word out. There are a lot of possibilities to choose, and I don't know what the optimal strategy is; I have the feeling that a word we've seen N occurrences of so far should only have an eviction probability proportional to 1/(N+1), and the new word should be considered as a possible eviction candidate itself (in which case it may not get inserted). The most frequent words, then, are very unlikely to get evicted, and if they do get evicted, they're very likely to get reinserted. Some less frequent words might get evicted and reinserted several times, so their final count will be low, but still higher than hapax legomena. If you consider words that are not at the bottom of the heap as possible eviction candidates, then there's really no need to maintain a heap at all. So this is a heuristic approach to finding the most frequent words in a file which is likely to find most of the most common words. ## Efficiently choosing from a dynamically varying discrete probability distribution ## Efficiently choosing a word to evict based on some arbitrary and dynamically-changing probability distribution can be done with a kind of binary search tree, one I think is called an interval tree. First, an array variant of the algorithm. Put the words at any given time in some arbitrary order: w[0], w[1], ... w[n]. Now the probability density function w[i].p is simply the desired probability of choosing that word, a number between 0 and 1, and from that we compute the CDF or cumulative distribution function w[i].c as the sum of w[j].p for all j <= i, or equivalently, w[0].c = w[0].p, and otherwise w[i].c = w[i-1].c + w[i].p. w[n].c ought to be 1. Given this, we can pick a random number r from 0 to 1 and binary-search the array for the largest i such that w[i].c >= r; this chooses a word with the desired probability distribution in O(lg N) time. The trouble with this is that updating w[i].p requires changing w[j].c for all j >= i, which is O(N). In the binary tree variant, each tree node tracks the total probability density of its left subtree. To establish this invariant originally, you could use this O(N) algorithm: def put_cdfs(tree): if tree is None: return 0 tree.left_total = put_cdfs(tree.left) return tree.left_total + tree.p + put_cdfs(tree.right) Then, to randomly choose a word, in O(lg N) time just like with the array: def choose(tree, r): if r <= tree.left_total: return choose(tree.left, r) if r <= tree.left_total + tree.p: return tree return choose(tree.right, r - tree.left_total - tree.p) Incrementing the probability of any node in the tree merely requires incrementing the left_total of some or all of its ancestors (omitting those whose parenthood is via a right child). This is O(lg N) if the tree is balanced. In essence, it's a standard binary search tree, but instead of the search key being the contents of the word, it's the maximum value of the CDF for that word --- plus a funny representation for the key. This allows you to add words in O(lg N) time, update probabilities in O(lg N) time, and choose words in O(lg N) time. With some tree rebalancing, it also allows you to remove words in O(lg N) time. (You don't need tree rebalancing to maintain O(lg N) on insert because you can insert wherever you like.) You'd presumably use a hash table, ideally with separate chaining, to find the treenode for a particular word in order to update it, costing three more pointers or so. Signed hashes ------------- A couple of the above cases talk about maintaining data structures full of strings, e.g. hash tables. If the strings are large, it may make more sense to store only a hash of the string, rather than the whole string. If you are storing your strings in a hash table with only about one item per bucket, and in the item you keep, say, a second, uncorrelated 32-bit hash of the string rather than a 32-bit pointer to the string itself, then you'll have a false positive about one in every four billion probes. For cases where a few errors aren't fatal, such as picking common substrings to ignore when hashing, this is likely to be a good tradeoff. It can also eliminate dynamic allocation entirely from the program. -- To unsubscribe: http://lists.canonical.org/mailman/listinfo/kragen-tol