Paul Houselander wrote: > I'm getting some spam slip through with subjects like > > vi'aqra pr,ofe'ssio,nal matters very much to your s.e,x > be self-satisfied - use vi'aqra s<u>per act,i've > vi'aqra pr<o>fessional - never forget about your s'e.x
This is something I'd typically just throw to the bayes database, but it's a particularly good example of the spammers' growing ability to mess with bayesian tokens. IIRC, bayes token generators will parse that into tokens {vi,agra,pr,ofe,ssio,nal,matters,...} and this may be becoming a problem. Perhaps it's worthwhile to revisit this mechanism. Strip the whole document of most (all?) non-alphanumeric/non-whitespace characters, collapse (and copy) s p a c e d words, /then/ feed into bayes: s/[\/_]/ /g; # arguably include dash here s/[^\s\w]//g; $raw = $_; my $spaced =''; while ($raw =~ /(?:\s+\w){2,}\s+/) { my $word = $&; $raw =~ s// /; $word =~ s/\s//g; $word = 'spaced:' . $word; # optional, notes the word was spaced $spaced .= ' ' . $word; } $_ .= $spaced; # or whenever So this: > vi'aqra pr,ofe'ssio,nal matters very much to your s.e,x > be self-satisfied - use vi'aqra s<u>per act,i've > vi'aqra pr<o>fessional - never forget about your s'e.x > test s p a c e d words t w i c e in a line > this is an act--i've shown it 5 x, a record! Becomes this: > viaqra professional matters very much to your sex > be selfsatisfied use viaqra super active > viaqra professional never forget about your sex > s p a c e d words t w i c e in a line spaced:spaced spaced:twice > this is an active shown it 5 x a record spaced:5xa The last line demonstrates two issues it creates, where words are mashed up a little too aggressively (in this case, it even forms a correct word!) while the "spaced:5xa" is mostly irrelevant. ... of course, my girlfriend (a bayesian ~expert) would probably chastise me and remind me that Bayes is supposed to be smart enough to figure this out on its own, but she'd also yell at us for using naive bayes (affectionately called "idiot bayes") on a simple bag-of-words model rather than using priors and examining sentence structure somehow (e.g. CRM114). With sentence structure, my suggestions become largely useless (and the minimum token counts rise significantly). (Is this better served as a bug on issues.apache.org?)