On 27 Nov 2003 01:13:04 -0600, Scott A Crosby <[EMAIL PROTECTED]>
posted to spamassassin-devel and spamassassin-talk:
 > On Wed, 26 Nov 2003 14:17:30 +0600, Alexander Litvinov
 > <[EMAIL PROTECTED]> writes:
 >> > Solution is to learn a monogram, bigram and trigram character model
 >> > for the ham you recieve. Mix the statistics together (to account for
 >> > partial information) and that'll be very good at detecting gibberish
 >> > and foreign languages. Assume if its not been seen before that its a
 >> > spam sign. Canonicalize the non-alphabetic tokens and it could detect,
 >> > weakly, mangled text like. V.I.A.G.....
 >> This can be the solution, but V.I.A.G is good example of byers work.
 > Not really. Spammers can use:
 > V.I.A.G.R.A
 > V.I.A.G.R A
<...>
 > V.I.A.G_R,A
 > V.I.A.G_R_A
 > and so on. Ignoring those that use " " which would break the word into
 > two tokens, that is 3^5, or 243 distinct tokens, If they add a random
 > [,_.] at the begin and end, thats 3^7=2187 distinct tokens, or 2^2*4^5
 > = 9216 distinct ways to write viagr, without mangling a single
 > letter.

The solution to this is to "normalize" each message before you pass it
to the rules which examine the n-grams. I believe that's what was
meant by "canonicalize" in the earliest message quoted above -- you'd
replace all punctuation (and maybe whitespace too) with a single
punctuation character ... or even strip out all punctuation and
whitespace entirely and then look at the resulting n-grams.

More generally, I believe it would make sense to define a handful of
different "normal forms" for different classes of rules.

/* era */

-- 
The email address era     the contact information   Just for kicks, imagine
at iki dot fi is heavily  link on my home page at   what it's like to get
spam filtered.  If you    <http://www.iki.fi/era/>  500 pieces of spam for
want to reach me, see     instead.                  each wanted message.



-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
Does SourceForge.net help you be more productive?  Does it
help you create better code?  SHARE THE LOVE, and help us help
YOU!  Click Here: http://sourceforge.net/donate/
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to