On Tue, Oct 08, 2002 at 11:23:59AM +0100, nemesis wrote: > Hello again, > > I have a database (mySQL) full of variable length text fields (average > about 1500 characters, 250 words). Curently there are about 250 fields, > but I hope this to expand to as many as possible (it is an online joke > archive). > > I need to be able to check that when I add another field that there is not > another already in the database that is the same or very similar. Checking > the MD5 sum won't work[0], it will only give an exact match, and some of > the jokes come in with slight variations in the wording but with that are > essentially the same, but the MD5 sum will not tell me this.
OK, I promised I'd answer this and what better time than when drunk... So when solving these kinds of things it's useful to think "who else might be solving this, or trying to solve it?" It turns out it's a fairly common problem, divining similarity in texts. How about... Spam. The perps have already wised up to checksumming messages and subject lines so often add random codes to the bottom of messages and to the end of subject lines to throw off these techniques (and also to track responses). You'll even see gratuitous <!-- HTML commenting --> mid-word. So the task now befalls the anti-spam brigade to detect this: messages similar but not identical. One such system that has addressed this wrt spam is Razor http://razor.sourceforge.net/ using cmeclax's 256bit Nilsimsa http://lexx.shinn.net/cmeclax/nilsimsa.html signatures to identify spams over the wire. Nilsimsa uses n-gram counting -- sequences of tokens (words, letters, spaces) and representing the results (frequencies) as vectors - this allows a whole class of mathematics theory to be used on things that wouldn't normally be amenable to such treatment. (Nilsimsa is, iirc, Lojban verb for "degree of similarity between..". Lojban is an artificial language like Esperanto, but on steroids.) A Nilsimsa signature as currently implemented is a 256 bit sequence, each bit representing an tri-gram bucket. Comparison between two signatures is as simple as comparing bits at each position, the more that match the more similar. The range is -127 to 128 (i.e. $matching_bits - 127). So once you have a database of Nilsimsa signatures you would, at first blush, try to compare a new {joke, spam, message}'s signature against each one, and get a short-list of similarity (Nilsimsa >100 is a pretty solid match). As it happens, the task of doing this in less than O(n) is still open, unfortunately, although some techniques for reducing this by a (big) factor have come up. Those more into the research side may be interested in showing whether each bit is an independent event and whether its probability remains constant (it was originally postulated as binomial but that doesn't look likely now). The relevance of this is being able to apply mathematical knowledge about a distribution, i.e. to say with certainty the certainty of similarity. Anyway, some code. Handily, there's Digest::Nilsimsa, a wrapper on the original nilsimsa source by one of the Razor authors. It only exposes the text -> signature capability, but there's a bunch of other goodies in the C code. This code slurps files on the command line and passes nilsimsa judgement on them, # paulm, 2002-10-10 # cmp.pl file1 file2 [file_n ...] use Digest::Nilsimsa; my $nilsimsa = new Digest::Nilsimsa; local $/; my %code; my $similarity = 100; $code{$ARGV} = pack "H*", $nilsimsa->text2digest($_) while <>; my @files = keys %code; for my $i (0..@files-2) { for my $j ($i+1..@files-1) { my $nilsimsa = nilsimsa(@code{@files[$i,$j]}); print "$files[$i] is like $files[$j] ($nilsimsa)\n" if $nilsimsa > $similarity; } } # nilsimsa is a measure of similarity, from -128 to 127; higher is more similar. # Measure by counting bit correlation of the Nilsimsa hashes. # Convert into a bit-string "011011011..." and use perl's bitwise string # xor, then count zeroes (bit matches of course end up as a zero). sub nilsimsa { ((unpack "b256", $_[0]) ^ (unpack "b256", $_[1])) =~ tr/\x00// - 128; } __END__ The Nilsimsa module uses the original source code but with int main() and other bits hacked out. Watch: $ ls -l main.c _nilsimsa.c -rw-r--r-- 1 paulm paulm 18608 Oct 9 15:26 _nilsimsa.c -rw-r--r-- 1 paulm paulm 21819 Oct 9 15:26 main.c $ ./cmp.pl main.c _nilsimsa.c main.c is like _nilsimsa.c (118) $ Nifty! I split my spam inbox up into message bodies, one per file and indeed is was picking them out quite nicely. Have fun! Cheers, Paul -- Paul Makepeace ....................................... http://paulm.com/ "If you were to ask me, then everyone will want to pass the buck." -- http://paulm.com/toys/surrealism/