file similarity, fuzzy hashing, bloom filters, and finding common words

Kragen Javier Sitaker Mon, 20 Dec 2010 00:37:05 -0800

Herewith a passel of algorithmic musings, all untested, and all uninformed by
previous work: some approaches to how to make fuzzy file signatures to allow
you to quickly find similar files, specifically involving hashing, vector math,
substrings of various sizes, bitvectors, Bloom filters, emphasizing more
unusual strings, and discovering which strings are unusual in a corpus.


If one of these approaches works, it could be useful for sorting blobs in git
packs to improve compression, for inferring cladistic trees from genome data,
for filtering spam, for detecting viruses, for discovering copyright violation
and plagiarism, for network intrusion detection, for forensic examination of
disks, and for maintaining a link to, say, a paragraph within a document that
is being edited (as in Queer Numbers).

The previous work I know about in this area is spamsum, CTPH, FTK, and ssdeep.
They use a different approach.  I don't know if this approach is likely to be
better or worse.  It certainly ought to be able to find much looser
similarities between files (e.g. emails written by different people in the same
language.)

Fuzzy file signatures for fuzzy "hashing"
-----------------------------------------

11 years ago, I tried this failed experiment:

<http://lists.canonical.org/pipermail/kragen-hacks/1999-February/000157.html>

In quick summary, the idea was to compute a compact "signature" of a file that
could be compared against another "signature" to see if the files were not only
identical, but similar, by hashing many substrings of the file and comparing
the counts in the hash tables.  In pseudocode:

    for each file f:
        hashtable[f] = [0 for i = 0 to tablesize]
        for i = 0 to f.length - n:
            bucket = hash(f.contents[i : i+n]) % tablesize
            hashtable[f][bucket]++

    magnitude = { f: sqrt(sum(val**2 for val in hashtable[f]))
                  for each file f }

    for each file f:
        for each file g:
            normalization = magnitude[f] * magnitude[g]
            dotproduct = sum(hashtable[f][i] * hashtable[g][i] for i = 0 to 
tablesize)
            similarity[f, g] = dotproduct / normalization

It didn't work because big files with small hash tables tended to push all the
hash buckets toward being more or less equal.

I was lying awake at night listening to the wind, and some ideas for improving
the approach came to me.

Both positive and negative hashes
---------------------------------

The basic problem of bias toward big files can be solved by decrementing the
buckets for some hash strings and incrementing them for others.  Instead of 

            bucket = hash(f.contents[i : i+n]) % tablesize
            hashtable[f][bucket]++

we have something like

            bucket = hash(f.contents[i : i+n]) % (2 * tablesize)
            hashtable[f][bucket % tablesize] += (1 if bucket < tablesize else 
-1)

This should make the directions of the file vectors distribute nicely all
around the hypersphere, which should make the normalized-dot-product approach
less ineffective.

Various sizes of substrings
---------------------------

If two files are different points in time of the same text file, they'll
probably have lots of 128-byte chunks in common, not just 16-byte chunks.  On
the other hand, if two files are merely written in the same natural language
(and character encoding), they'll have hardly any 128-byte chunks in common,
but probably lots of 8-byte chunks.  But looking only at 8-byte chunks,
particularly hashed, is insufficient for identifying two files as different
versions of the same file --- at that granularity, they'll hardly look any more
similar than any other two random files from the same language.

So a *general-purpose* file similarity checker might use several different
sizes of substring: 4-byte, 16-byte, and 64-byte, say.

Bloom filters and randomly discarding data
------------------------------------------

The hoped-for result here is signature vectors of a very high dimensionality,
such that the expected correlation between two unrelated vectors will be very
small.  To increase dimensionality with a fixed memory footprint, you use fewer
bits per hash bucket.  The extreme of this approach is to use a bitvector
rather than a vector of integers, rather like a Bloom filter --- although I
think the probability of false negatives isn't relevant, so there's probably no
need to use more than one hash function.

(Sometimes machine learning people talk about the "curse of dimensionality": if
your dataset has a lot of dimensions, it can be hard to predict anything from
it because, essentially, the points in the dataset all tend to be on or near
the convex hull.  I don't think the curse of dimensionality applies here;
really, the more dimensions we have here, the better, subject to the
limitations of computing power.)

A difficulty with this approach is that your bitvector needs to be large if you
don't want it to be all ones.  One solution to this is the
both-positive-and-negative-hashes thing mentioned previously, but that still
leaves the problem that the bitvector you have at the end of the file is really
only the product of the last few dozen or hundred bytes of the file.  

A different, and perhaps better, approach is to use a very large bitvector
while you're hashing the file, but then discard most of the bitvector when you
finish.  If the hash function is properly random, the first 64 bits of your
bitvector will still be expected to be pretty uncorrelated with the first 64
bits of some other file's bitvector, but pretty correlated with the first 64
bits of a bitvector from another version of the same file.  (I think you can
calculate that correlation with two instructions on modern amd64 CPUs.)

Ideally, you'd like those bits to be about half 1s and half 0s, so that the
expected correlation with an uncorrelated vector is 32 of those 64 bits.  If
all your files were the same size, you could achieve this by choosing the right
size for your large bitvector,  But what if all the files aren't the same size,
or what if some of them have a lot more repetition than others?  Some of them
will have hundreds of times more hash values than others.

Well, suppose your method of cutting down your raw hash value to your hash
table size is simply to shift off the N low-order bits of the hash.  This
essentially merges adjacent hash buckets, by twos, fours, eights, however many
are needed.  Well, in that case, you can do that after the fact, too!  It's
slightly easier if you shift "off" the N low-order bits that don't index into a
word, by ORing adjacent words together until you have enough bits set for your
desired signature density.

In this approach, you don't even need to maintain the whole huge bitvector
physically in memory; except for occasional pathological cases, such as empty
files or huge blocks of zero bytes, you can be assured of finding 32 separate
bits to set in your 64-bit signature word within only the first few hundred
signature words.

So this signature creation algorithm looks something like this:

        bits_per_word = 64
        virtual_hash_words = 65536
        physical_hash_words = 256
        bitvectors = [0 for i = 0 to physical_hash_words]
        for i = 0 to file.length - n:
            bit_index = hash(file.contents[i : i+n]) % (virtual_hash_words * 
bits_per_word)
            (bitshift, word_index) = (bit_index % bits_per_word, bit_index / 
bits_per_word)
            if word_index < physical_hash_words:
                bitvectors[word_index] |= 1 << bitshift

        signature = 0
        i = 0
        while popcount(signature) < bits_per_word / 2 and i < 
physical_hash_words:
            signature |= bitvectors[i++]

The numbers above should suffice for a relatively wide range of file sizes ---
those with only about 40 distinct hash values falling into the 256, which I
think is files above about 16 kilobytes, up to those with about 40 distinct
hash values falling into the first word, which I think is files about 256 times
as large, or about four megabytes.  If you hash three distinct lengths of
strings, divide those file sizes by three.

That's still not a very large range of file sizes, though.  There's a hack that
vastly expands this, though.  Rather than this:

    if word_index < physical_hash_words:
        bitvectors[word_index] |= 1 << bitshift

we hack the word_index logarithmically:

    bitvectors[logarithmic_hack(word_index)] |= 1 << bitshift

where `logarithmic_hack` maps 0 => 0, 1 => 1, 2 => 1, 3..7 => 2, 8..15 => 3,
and so on, which I guess means that it's just `floor(log(word_index)/log(2))`,
an efficient algorithm for which can almost certainly be found in Hacker's
Delight.  This would allow this algorithm to scale all the way from empty files
and files of one byte all the way up to exabyte-sized files while using only 4k
or so of storage to calculate the signature.

Long files will not appear to be similar to short files with this approach,
unless the long files happen to consist almost entirely of bits of the short
file repeated over and over again. I think that's okay.

This approach resembles Pentti Kanerva's associative storage scheme that he
calls "fully distributed representation", which as I recall is a kind of
"spatter code".  He suggested that the task of finding which known codeword from
a large collection has the closest Hamming distance to a search-key codeword
could perhaps be handled more efficiently by a neural network than by
sequential search.  I don't know enough about neural networks to comment on
whether he's right.

Preferring more unusual strings
-------------------------------

Unfortunately, most of the data going into any of these algorithms I've written
about above is the hash values of very common strings, strings with poor
discriminatory power.  In the bitvector/data-discarding approach, this means
you're actually discarding the data you need and it doesn't even make it into
the signature, while in the dot-product approach, it may merely be lost in the
noise getting added to all the components of the vector.

If you had a broad signature of "ordinary files", say a Bloom filter of
tens of gigabytes of common strings, you could avoid putting those common
strings into your signatures and thus boost their resolving power, through a
sort of TF/IDF. This implies, however, a kind of model of the process
generating your files, and if that model is inaccurate, it could result in this
approach doing worse than the baseline.

Preferring more common strings in the Bloom filter
--------------------------------------------------

The other half of TF/IDF is TF: valuing more common terms more.  If you used a
three-bit counting Bloom filter, for example, you could avoid putting hash
values into your signature that occurred less than 7 times, thus emphasizing
the strings that occurred many times in the file instead of those whose hash
values happened to be small.

Unfortunately, this multiplies the memory requirements of the hashing stage; if
you want to be able to count higher than N in any hash bucket, you must have lg
N bits per hash bucket.

However, this isn't necessarily true if you're willing to accept errors and
count approximately.  In four bits, for example, you can construct a counter
that goes to 256, approximately; simply randomly fail to carry half the time.
The least significant bit will toggle, on average, once for every two times you
try to increment the counter, which means it will transition from high to low
once every four times; the next-least-significant bit will toggle, on average,
once for every two times the LSB transitions high-to-low, which means once
every eight inputs; and so on.  (It may be a good idea to have less-significant
bits toggle with lower probability than more-significant bits, since that will
give you less error for the same expected counting range in the same number of
bits.  The optimum may in fact be to toggle the LSB with low probability while
incrementing the rest of the bits in the usual way, which is equivalent to
simply doing an exact count of a random sample of the input.)

There's no theoretical limit to how high you can count in an arbitrarily small
number of bits with this method, but of course more bits will produce more
precise results.

This approach would allow you to compute (an approximate upper bound on) the
corpus frequency of a given word using a compact, fixed-size data structure,
and similarly for the frequency within a given document.

Tournaments to find more common strings
---------------------------------------

(I'm pretty sure I posted some nonworking code for this to kragen-hacks once
too.)

Suppose you want to find the most common strings (I'll say "words" hereafter in
this section) in a long file.  The simplest way is to make a file of all the
words, sort it, and count the duplicates, but this requires many passes over
the file and potentially a lot more space than the original input.

Instead, you could stream through the file, maintaining a bounded-size heap of 
the
approximately most common words you've seen so far, ordered by approximate
frequency.  The most common word would be at the top of the heap, while the
less common words would be further down the heap, with the least common words
at the bottom.

Until the heap hits its size bound, all the numbers in the heap are exact.
Once it fills up, though, to insert a new word into the bottom of the heap
requires kicking an old word out.  There are a lot of possibilities to choose,
and I don't know what the optimal strategy is; I have the feeling that a word
we've seen N occurrences of so far should only have an eviction probability
proportional to 1/(N+1), and the new word should be considered as a possible
eviction candidate itself (in which case it may not get inserted).

The most frequent words, then, are very unlikely to get evicted, and if they do
get evicted, they're very likely to get reinserted.  Some less frequent words
might get evicted and reinserted several times, so their final count will be
low, but still higher than hapax legomena.

If you consider words that are not at the bottom of the heap as possible
eviction candidates, then there's really no need to maintain a heap at all.

So this is a heuristic approach to finding the most frequent words in a file
which is likely to find most of the most common words.

## Efficiently choosing from a dynamically varying discrete probability 
distribution ##

Efficiently choosing a word to evict based on some arbitrary and
dynamically-changing probability distribution can be done with a kind of binary
search tree, one I think is called an interval tree.  

First, an array variant of the algorithm. Put the words at any
given time in some arbitrary order: w[0], w[1], ... w[n].  Now the probability
density function w[i].p is simply the desired probability of choosing that
word, a number between 0 and 1, and from that we compute the CDF or cumulative
distribution function w[i].c as the sum of w[j].p for all j <= i, or
equivalently, w[0].c = w[0].p, and otherwise w[i].c = w[i-1].c + w[i].p.
w[n].c ought to be 1.  Given this, we can pick a random number r from 0 to 1
and binary-search the array for the largest i such that w[i].c >= r; this
chooses a word with the desired probability distribution in O(lg N) time.

The trouble with this is that updating w[i].p requires changing w[j].c for all
j >= i, which is O(N).

In the binary tree variant, each tree node tracks the total probability density
of its left subtree.  To establish this invariant originally, you could use this
O(N) algorithm:

    def put_cdfs(tree):
        if tree is None:
            return 0
        tree.left_total = put_cdfs(tree.left)
        return tree.left_total + tree.p + put_cdfs(tree.right)

Then, to randomly choose a word, in O(lg N) time just like with the array:

    def choose(tree, r):
        if r <= tree.left_total:
            return choose(tree.left, r)
        if r <= tree.left_total + tree.p:
            return tree
        return choose(tree.right, r - tree.left_total - tree.p)

Incrementing the probability of any node in the tree merely requires
incrementing the left_total of some or all of its ancestors (omitting those
whose parenthood is via a right child).  This is O(lg N) if the tree is
balanced.

In essence, it's a standard binary search tree, but instead of the search key
being the contents of the word, it's the maximum value of the CDF for that
word --- plus a funny representation for the key.

This allows you to add words in O(lg N) time, update probabilities in O(lg N)
time, and choose words in O(lg N) time.  With some tree rebalancing, it also
allows you to remove words in O(lg N) time.  (You don't need tree rebalancing
to maintain O(lg N) on insert because you can insert wherever you like.)

You'd presumably use a hash table, ideally with separate chaining, to find the
treenode for a particular word in order to update it, costing three more
pointers or so.

Signed hashes
-------------

A couple of the above cases talk about maintaining data structures full of
strings, e.g. hash tables.  If the strings are large, it may make more sense to
store only a hash of the string, rather than the whole string.  If you are
storing your strings in a hash table with only about one item per bucket, and
in the item you keep, say, a second, uncorrelated 32-bit hash of the string
rather than a 32-bit pointer to the string itself, then you'll have a false
positive about one in every four billion probes.

For cases where a few errors aren't fatal, such as picking common substrings to
ignore when hashing, this is likely to be a good tradeoff. It can also
eliminate dynamic allocation entirely from the program.
-- 
To unsubscribe: http://lists.canonical.org/mailman/listinfo/kragen-tol

file similarity, fuzzy hashing, bloom filters, and finding common words

Reply via email to