On Tue, Oct 08, 2002 at 11:23:59AM +0100, nemesis wrote:
> Hello again,
> 
> I have a database (mySQL) full of variable length text fields (average 
> about 1500 characters, 250 words).  Curently there are about 250 fields, 
> but I hope this to expand to as many as possible (it is an online joke 
> archive).
> 
> I need to be able to check that when I add another field that there is not 
> another already in the database that is the same or very similar.  Checking 
> the MD5 sum won't work[0], it will only give an exact match, and some of 
> the jokes come in with slight variations in the wording but with that are 
> essentially the same, but the MD5 sum will not tell me this.

OK, I promised I'd answer this and what better time than when drunk...

So when solving these kinds of things it's useful to think "who else
might be solving this, or trying to solve it?" It turns out it's a
fairly common problem, divining similarity in texts.

How about... Spam. The perps have already wised up to checksumming
messages and subject lines so often add random codes to the bottom of
messages and to the end of subject lines to throw off these techniques
(and also to track responses). You'll even see gratuitous <!-- HTML
commenting --> mid-word. So the task now befalls the anti-spam brigade
to detect this: messages similar but not identical.

One such system that has addressed this wrt spam is Razor
http://razor.sourceforge.net/ using cmeclax's 256bit Nilsimsa
http://lexx.shinn.net/cmeclax/nilsimsa.html signatures to identify spams
over the wire. Nilsimsa uses n-gram counting -- sequences of tokens
(words, letters, spaces) and representing the results (frequencies) as
vectors - this allows a whole class of mathematics theory to be used on
things that wouldn't normally be amenable to such treatment.

(Nilsimsa is, iirc, Lojban verb for "degree of similarity between..".
Lojban is an artificial language like Esperanto, but on steroids.)

A Nilsimsa signature as currently implemented is a 256 bit sequence,
each bit representing an tri-gram bucket. Comparison between two
signatures is as simple as comparing bits at each position, the more
that match the more similar. The range is -127 to 128
(i.e. $matching_bits - 127).

So once you have a database of Nilsimsa signatures you would, at first
blush, try to compare a new {joke, spam, message}'s signature against
each one, and get a short-list of similarity (Nilsimsa >100 is a
pretty solid match). As it happens, the task of doing this in less than
O(n) is still open, unfortunately, although some techniques for reducing
this by a (big) factor have come up. Those more into the research side
may be interested in showing whether each bit is an independent event
and whether its probability remains constant (it was originally
postulated as binomial but that doesn't look likely now). The relevance
of this is being able to apply mathematical knowledge about a
distribution, i.e. to say with certainty the certainty of similarity.

Anyway, some code. Handily, there's Digest::Nilsimsa, a wrapper on the
original nilsimsa source by one of the Razor authors. It only exposes
the text -> signature capability, but there's a bunch of other goodies
in the C code.

This code slurps files on the command line and passes nilsimsa judgement
on them,

# paulm, 2002-10-10
# cmp.pl file1 file2 [file_n ...]

use Digest::Nilsimsa;
my $nilsimsa = new Digest::Nilsimsa;

local $/;
my %code;
my $similarity = 100;

$code{$ARGV} = pack "H*", $nilsimsa->text2digest($_) while <>;

my @files = keys %code;
for my $i (0..@files-2) {
        for my $j ($i+1..@files-1) {
                my $nilsimsa = nilsimsa(@code{@files[$i,$j]});
                print "$files[$i] is like $files[$j] ($nilsimsa)\n" if $nilsimsa > 
$similarity;
        }
}

# nilsimsa is a measure of similarity, from -128 to 127; higher is more similar.
# Measure by counting bit correlation of the Nilsimsa hashes.
# Convert into a bit-string "011011011..." and use perl's bitwise string
# xor, then count zeroes (bit matches of course end up as a zero).
sub nilsimsa {
        ((unpack "b256", $_[0]) ^ (unpack "b256", $_[1])) =~ tr/\x00// - 128;
}
__END__

The Nilsimsa module uses the original source code but with int main()
and other bits hacked out. Watch:

$ ls -l main.c _nilsimsa.c 
-rw-r--r--    1 paulm    paulm       18608 Oct  9 15:26 _nilsimsa.c
-rw-r--r--    1 paulm    paulm       21819 Oct  9 15:26 main.c
$ ./cmp.pl main.c _nilsimsa.c 
main.c is like _nilsimsa.c (118)
$

Nifty!

I split my spam inbox up into message bodies, one per file and indeed is
was picking them out quite nicely. Have fun!

Cheers,
Paul

-- 
Paul Makepeace ....................................... http://paulm.com/

"If you were to ask me, then everyone will want to pass the buck."
   -- http://paulm.com/toys/surrealism/

Reply via email to