On 27 Nov 2003 01:13:04 -0600, Scott A Crosby <[EMAIL PROTECTED]> posted to spamassassin-devel and spamassassin-talk: > On Wed, 26 Nov 2003 14:17:30 +0600, Alexander Litvinov > <[EMAIL PROTECTED]> writes: >> > Solution is to learn a monogram, bigram and trigram character model >> > for the ham you recieve. Mix the statistics together (to account for >> > partial information) and that'll be very good at detecting gibberish >> > and foreign languages. Assume if its not been seen before that its a >> > spam sign. Canonicalize the non-alphabetic tokens and it could detect, >> > weakly, mangled text like. V.I.A.G..... >> This can be the solution, but V.I.A.G is good example of byers work. > Not really. Spammers can use: > V.I.A.G.R.A > V.I.A.G.R A <...> > V.I.A.G_R,A > V.I.A.G_R_A > and so on. Ignoring those that use " " which would break the word into > two tokens, that is 3^5, or 243 distinct tokens, If they add a random > [,_.] at the begin and end, thats 3^7=2187 distinct tokens, or 2^2*4^5 > = 9216 distinct ways to write viagr, without mangling a single > letter.
The solution to this is to "normalize" each message before you pass it to the rules which examine the n-grams. I believe that's what was meant by "canonicalize" in the earliest message quoted above -- you'd replace all punctuation (and maybe whitespace too) with a single punctuation character ... or even strip out all punctuation and whitespace entirely and then look at the resulting n-grams. More generally, I believe it would make sense to define a handful of different "normal forms" for different classes of rules. /* era */ -- The email address era the contact information Just for kicks, imagine at iki dot fi is heavily link on my home page at what it's like to get spam filtered. If you <http://www.iki.fi/era/> 500 pieces of spam for want to reach me, see instead. each wanted message. ------------------------------------------------------- This SF.net email is sponsored by: SF.net Giveback Program. Does SourceForge.net help you be more productive? Does it help you create better code? SHARE THE LOVE, and help us help YOU! Click Here: http://sourceforge.net/donate/ _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk