"Fred I-IS.COM" <[EMAIL PROTECTED]> writes:
> I created a list which might be helpful, using a dictionary I searched for
> letter pairs which did not exist. I created the following meta rule to
> search for these non-existant pairs, it might do just what you are looking
> for.
Your meta rule seems to work pretty well.
Some issues that might need to be worked out:
- getting it to work in an internationalized fashion, we could just
write a rule to be used when the message specifies that it is
English, when "ok_languages en" is set, or something like that,
but that is non-optimal
- false positives are still a bit high:
- PGP signatures
- some "legitimate" URLs (Network Solutions unsubscribe URL for
renewal notices)
Another thing that might work well is instead using an eval test that
counts non-existent pairs. There are also the triplets and N-gram files
used by the language testing in TextCat.pm -- we could test N-gram
frequency and if the advertized language is well off the language model
for that language, then score a hit.
Some quick results:
OVERALL% SPAM% HAM% S/O RANK SCORE NAME
9810 4814 4996 0.491 0.00 0.00 (all messages)
100.000 49.0724 50.9276 0.491 0.00 0.00 (all messages as %)
5.902 11.8612 0.1601 0.987 0.90 1.00 T_FVGT_M_MULTI_ODD_3
9.521 19.0278 0.3603 0.981 0.89 1.00 T_FVGT_M_MULTI_ODD_2
15.821 30.1413 2.0216 0.937 0.80 1.00 T_FVGT_M_MULTI_ODD_1
slightly revised rule definitions:
------- start of cut text --------------
# Frederic Tarasevicius
# Internet Information Services, Inc.
# From: "Fred I-IS.COM" <[EMAIL PROTECTED]>
# Message-ID: <[EMAIL PROTECTED]>
# Subject: Re: [SAtalk] Consonant and Vowel Pairs or Sequences
# To: <[EMAIL PROTECTED]>
# Date: Mon, 13 Oct 2003 17:13:31 -0400
body __OBFU_J /j[bcfgw]/i
body __OBFU_OTHER /(?:vj|vk|xj|xk|yy|zf|zj)/i
body __OBFU_Q0 /[jkpqtvwz]q/i
body __OBFU_Q1 /q[afhjkmnsy]/i
body __OBFU_V /[fgqw]v/i
body __OBFU_X /[cgjkqsvz]x/i
body __OBFU_Z /[fjkpqx]z/i
meta T_FVGT_M_MULTI_ODD_1 ((__OBFU_J + __OBFU_OTHER + __OBFU_Q0 + __OBFU_Q1 +
__OBFU_V + __OBFU_X + __OBFU_Z) > 1)
meta T_FVGT_M_MULTI_ODD_2 ((__OBFU_J + __OBFU_OTHER + __OBFU_Q0 + __OBFU_Q1 +
__OBFU_V + __OBFU_X + __OBFU_Z) > 2)
meta T_FVGT_M_MULTI_ODD_3 ((__OBFU_J + __OBFU_OTHER + __OBFU_Q0 + __OBFU_Q1 +
__OBFU_V + __OBFU_X + __OBFU_Z) > 3)
------- end ----------------------------
-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
SourceForge.net hosts over 70,000 Open Source Projects.
See the people who have HELPED US provide better services:
Click here: http://sourceforge.net/supporters.php
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk