revising bayes tokenization

Adam Katz Tue, 12 May 2009 13:15:22 -0700

Paul Houselander wrote:
> I'm getting some spam slip through with subjects like
> 
> vi'aqra pr,ofe'ssio,nal matters very much to your s.e,x
> be self-satisfied - use vi'aqra s<u>per act,i've
> vi'aqra pr<o>fessional - never forget about your s'e.x


This is something I'd typically just throw to the bayes database, but
it's a particularly good example of the spammers' growing ability to
mess with bayesian tokens.  IIRC, bayes token generators will parse
that into tokens {vi,agra,pr,ofe,ssio,nal,matters,...} and this may be
becoming a problem.

Perhaps it's worthwhile to revisit this mechanism.  Strip the whole
document of most (all?) non-alphanumeric/non-whitespace characters,
collapse (and copy)  s p a c e d  words, /then/ feed into bayes:

  s/[\/_]/ /g; # arguably include dash here
  s/[^\s\w]//g;
  $raw = $_;
  my $spaced ='';
  while ($raw =~ /(?:\s+\w){2,}\s+/) {
    my $word = $&;
    $raw =~ s// /;
    $word =~ s/\s//g;
    $word = 'spaced:' . $word; # optional, notes the word was spaced
    $spaced .= ' ' . $word;
  }
  $_ .= $spaced; # or whenever

So this:
> vi'aqra pr,ofe'ssio,nal matters very much to your s.e,x
> be self-satisfied - use vi'aqra s<u>per act,i've
> vi'aqra pr<o>fessional - never forget about your s'e.x
> test  s p a c e d  words  t w i c e in a line
> this is an act--i've shown it 5 x, a record!

Becomes this:
> viaqra professional matters very much to your sex
> be selfsatisfied  use viaqra super active
> viaqra professional  never forget about your sex
> s p a c e d  words  t w i c e in a line spaced:spaced spaced:twice
> this is an active shown it 5 x a record spaced:5xa

The last line demonstrates two issues it creates, where words are
mashed up a little too aggressively (in this case, it even forms a
correct word!) while the "spaced:5xa" is mostly irrelevant.

... of course, my girlfriend (a bayesian ~expert) would probably
chastise me and remind me that Bayes is supposed to be smart enough to
figure this out on its own, but she'd also yell at us for using naive
bayes (affectionately called "idiot bayes") on a simple bag-of-words
model rather than using priors and examining sentence structure
somehow (e.g. CRM114).  With sentence structure, my suggestions become
largely useless (and the minimum token counts rise significantly).

(Is this better served as a bug on issues.apache.org?)

revising bayes tokenization

Reply via email to