Hello Charles,

Thursday, February 19, 2004, 11:32:03 AM, you wrote:

CG> Hello!

CG> I'm seeing some spam with bogus-looking 'yahoo' message-ID's.
CG> Could someone please test this rule against a nice large corpus?

I took your two suggestions,

CG> header LOC_BADYAHOOMSGID  Message-ID =~ /[EMAIL PROTECTED]/i
CG> header LOC_BADYAHOOMSGID  Message-ID =~ /[A-Z]{8,[EMAIL PROTECTED]/

And tested the following variations:

header   LOC_BADYAHOOMSGID1   Message-ID =~ /[EMAIL PROTECTED]/i
describe LOC_BADYAHOOMSGID1   From Charles Gregory <[EMAIL PROTECTED]>
score    LOC_BADYAHOOMSGID1   0.5
header   LOC_BADYAHOOMSGID2   Message-ID =~ /[A-Z]{8,[EMAIL PROTECTED]/
describe LOC_BADYAHOOMSGID2   From Charles Gregory <[EMAIL PROTECTED]>
score    LOC_BADYAHOOMSGID2   0.5
header   LOC_BADYAHOOMSGID3   Message-ID =~ /[EMAIL PROTECTED]/
describe LOC_BADYAHOOMSGID3   From Charles Gregory <[EMAIL PROTECTED]>
score    LOC_BADYAHOOMSGID3   0.5
header   LOC_BADYAHOOMSGID4   Message-ID =~ /[EMAIL PROTECTED]/
describe LOC_BADYAHOOMSGID4   From Charles Gregory <[EMAIL PROTECTED]>
score    LOC_BADYAHOOMSGID4   0.5

2 and 3 should be equivalent -- the "and more" comma has no real effect
here (except maybe on performance).

I quoted the period in .com in moving from 3 to 4.

Results:

Section 3 -- Frequencies Log
(First numeric frequencies, followed by percentage frequencies)

OVERALL     SPAM      HAM     S/O   SCORE  NAME
 100793    82099    18694    0.815   0.00    0.00  (all messages)
   1218     1218        0    1.000   1.00   0.50  LOC_BADYAHOOMSGID3
   1218     1218        0    1.000   1.00   0.50  LOC_BADYAHOOMSGID4
   1218     1218        0    1.000   1.00   0.50  LOC_BADYAHOOMSGID2
   1647     1639        8    0.979   0.00   0.50  LOC_BADYAHOOMSGID1

OVERALL%   SPAM%     HAM%     S/O    RANK   SCORE  NAME
 100793    82099    18694    0.815   0.00    0.00  (all messages)
100.000  81.4531  18.5469    0.815   0.00    0.00  (all messages as %)
  1.208   1.4836   0.0000    1.000   1.00    0.50  LOC_BADYAHOOMSGID3
  1.208   1.4836   0.0000    1.000   1.00    0.50  LOC_BADYAHOOMSGID4
  1.208   1.4836   0.0000    1.000   1.00    0.50  LOC_BADYAHOOMSGID2
  1.634   1.9964   0.0428    0.979   0.00    0.50  LOC_BADYAHOOMSGID1

My ham corpus includes lots of emails from yahoo.com webmail users, and
lots of YahooGroups email mailing lists.

Bob Menschel



Reply via email to