-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hello Jack,

Saturday, July 5, 2003, 6:50:05 PM, you wrote:

>> Train bayes.  Everyone has a different bayes db, and they can't
>> work around that centrally.

JG> The problem I'm seeing is that I'm getting messages with a Bayes of
JG> 90% but it still slips through with 4.5-5.

Bayes is conservative. Of 3,629 emails I've logged with Bayes_90, only
one has been not-spam (from Enterprise Rent-a-car, a 3-line email asking
for confirmation of an email address).

I run my system conservatively also, with a required-hits of 9 instead of
the default 5. I've raised the score for Bayes_90 to 7 (the default is
either 4 or 3, depending on other testing methods), and the score for
Bayes_99 to 9 (IOW, Bayes_99 IS spam). I'm considering raising the score
for Bayes_90 to somewhere in the 7.5 to 8.0 range.

Any spam which scores under 10, and isn't already Bayes_99, gets fed back
into Bayes to be learned as spam. (Likewise, any non-spam with a score
over +1 is fed into Bayes to be learned as ham.) Bayes works wonders with
these.

I've also modified various other scores from the defaults, about 140 of
them. Of those 140, I've reduced the scores for 3, and raised the scores
for the rest.

Some of these increases are probably required only because of my
required-hits = 9, but others are because the distributed scores are
themselves conservative, concentrating on avoiding false positives for
EVERYONE, and I can be a bit more aggressive because I know things about
my users (eg: *none* of them are interested in porn, none are in/near
bankruptcy, etc). Most notable among those:
CONFIRMED_FORGED      -- 7.00
FAKED_UNDISC_RECIPS   -- 6.00
FORGED_AOL_RCVD       -- 5.50
FORGED_MUA_OIMO       -- 5.10
FORGED_RCVD_TRAIL     -- 7.00
FROM_OFFERS           -- 5.00
HGH                   -- 9.44
LOW_INTEREST          -- 6.23
NIGERIAN_BODY         -- 9.10
PENIS_ENLARGE         -- 5.00
PENIS_ENLARGE2        -- 5.00
RATWARE_EGROUPS       -- 9.43
RATWARE_OE_MALFORMED  -- 4.80
RCVD_FAKE_HELO_DOTCOM -- 5.50
REVERSE_AGING         -- 6.21
SUSPICIOUS_RECIPS     -- 4.00
TO_MALFORMED          -- 4.10
TO_NO_USER            -- 4.30
VIAGRA                -- 5.00
WITH_LC_SMTP          -- 6.25

(If anyone's interested, the ones I've reduced are: FROM_ENDS_IN_NUMS,
FROM_NO_LOWER, and NO_REAL_NAME.)

The way I determine these is by watching for false negatives, spam that
slips through. I see what tests were matched, determine what the scores
are, scan my own corpus to determine for myself whether these tests are
matched by non-spam, and from that determine how much to raise the
scores.

Some scores which are only suggestive of spam I don't raise at all, or
just minimally (0.1 or 0.2 max). Some scores which match only spam on my
system are increased until they're about 50% of the required-hits (being
careful), while others are increased to or over the required-hits value
(being confident).

My corpus of now contains a good 8k spam, 90% of which has gone through
SA and so has test names in the headers (slightly over two months'
worth), and 10k non-spam (my personal email for this year, plus the past
month's email for other accounts I'm using SA against). My email client
allows reg-ex searches, so I'm able to simulate many SA tests without
running SA. It helps. 

Also, check out William Sterns' collected blacklist at
http://www.stearns.org/sa-blacklist/sa-blacklist.current -- it's a
marvelous resource which traps a whole lot of spam that would otherwise
sneak through.

JG> But, keep it in proportion. I'm still trapping over 98%.

Just tweaking the distributed tests, using Sterns' blacklists, and adding
a few blacklisted entries of my own (recently submitted to Sterns)
brought me to the 99% mark.

Two weeks ago I began creating my own rules to catch the rest.

This week I have received 2002 emails, 13 of which were false negatives.
That's 99.35%, and each and every one of those false negatives is now
caught by some combination of the above tweaking.

I actually had one day this week when I had zero false negatives -- first
day I didn't receive spam to any email account in years. I'm looking
forward to maybe having a whole week later this summer when no spam
sneaks through.

Bob Menschel

-----BEGIN PGP SIGNATURE-----
Version: PGP 8.0

iQA/AwUBPwerIJebK8E4qh1HEQJDVACghpnYWCr3Ay2NkjYkdOmJYlfHmsAAn0Sm
Urg+IcRfpDlorCgVHB8wRIPq
=5GIv
-----END PGP SIGNATURE-----




-------------------------------------------------------
This SF.Net email sponsored by: Free pre-built ASP.NET sites including
Data Reports, E-commerce, Portals, and Forums are available now.
Download today and enter to win an XBOX or Visual Studio .NET.
http://aspnet.click-url.com/go/psa00100006ave/direct;at.asp_061203_01/01
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to