Re[4]: Is Bayes Really Necessary?

2005-05-27 Thread Robert Menschel
Hello List,

Thursday, May 26, 2005, 11:01:23 PM, you wrote:

LMU P.S. I know the account says List Mail User, but why is this the only
LMU mailing list that almost uniformly references me that way?  Though, I do
LMU get called by the sobriquet Administrative User when I use accounts
LMU which are labeled like that.  Maybe, it just this list's user base is
LMU ingrained in using the header label instead of the signature!?  Anyway,
LMU I kind of like the LMU :)

Don't know.  Me, I kind of like responding to the list.  :-)

LMUA quick check of the last couple of days shows 72.96% at BAYES_00
LMU and 10% at BAYES_99 and 11.29% at BAYES_50.  I suspect the results are less
LMU extreme for you, but maybe not (that would be good to hear).  Note: I have
LMU a lot of MTA level rejection, pre-filtering before SA that takes out most
LMU of the remaining spam and almost all mailing lists are set to use the
LMU bayes_ignore_to directive - so my results posted above are highly skewed
LMU by all these factors (e.g.  40% of valid email does not run through bayes,
LMU and things like nightly server reports generated internally do - I don't
LMU even trust my own firewall machines' reports).

Interesting stats.

Last month's ham (110,735):
th - 00 - 110173 = 99.5%
th - 01 - 4
th - 05 - 191
th - 20 - 164
th - 30 - 0
th - 40 - 144
th - 44 - 1
th - 50 - 6
th - 60 - 20
th - 80 - 8
th - 95 - 1
th - 99 - 23 = 0.02%

Last month's spam: (79,749):
ts - 00 - 16346  = 20.5%
ts - 01 - 1
ts - 05 - 877=  1.1%
ts - 20 - 1283   =  1.6%
ts - 30 - 2
ts - 40 - 1607   =  2.0%
ts - 44 - 8
ts - 50 - 415
ts - 60 - 3588   =  4.5%
ts - 80 - 3695   =  4.6%
ts - 95 - 2596   =  3.3%
ts - 99 - 49331  = 61.9%

Obviously Bayes does a whole lot better with ham than it does with
spam here.

Many of the spam that hit BAYES_00 are outscatter. I've identified at
least 3,000 of those during the last month's work on the new obfu
rules. Now that those obfu rules are in place, I suspect those
percentages will shift nicely, but we'll probably continue to get 10%
of spam at Bayes_00.

Yes, you're right -- we do have a lot of other tricks in use here to
get them flagged as spam.   :-)

I hadn't realized that as many as 23 ham had hit BAYES_99. I would
have guessed it was only 5 or 6. We do have a lot of negative scoring
rules which pulled those down as well.  All of them were valid ham
marketing emails from the likes of United Airlines and Staples, which
are now covered by SARE's whitelist.cf.

We did have 15 FPs during this period of time, none of which will
repeat because of whitelist.cf

Bob Menschel







Re: Re[4]: Is Bayes Really Necessary?

2005-05-27 Thread List Mail User
Bob,

The Staples mention was of interest since I get their weekly ads
to an account here.  The very last one hit BAYES_50, but all the others
were from BAYES_00 to (from a 3.0.1 install) BAYES_44. - Most were BAYES_20
(I looked back 4 months - how long that account's mail is kept locally; I
could check archives for  10 years, but I think I've only been getting the
Staples ads for about 4 years).  All scored between .5 and 2.1 points.
I've seen a few ads from other vendors come much closer to the limit on
the accounts used (all vendors advertising intended for me goes to unique
email addresses, but they get collected by aliases in groups by industry
and use - e.g.  Staples ads don't go to the same mailbox as ads for NLOS
telecom gear).  Oddly, some of the most obscure technical items often score
the highest;

There definitity is a `style' issue at work.  It appears that both
some legitimate companies and people who write copy that looks like spam
and some spammers are good at generating messages that seems to be ham to
bayes.


Paul Shupak
[EMAIL PROTECTED]

P.S.  The last Staples ad was from this Monday, May 23 and (for me) hit:
score=0.5 required=5.0 tests=AWL,BAYES_50,EXCUSE_10,
HTML_90_100,HTML_IMAGE_RATIO_04,HTML_MESSAGE,REMOVE_PAGE,
URIBL_RHS_ABUSE,URI_REDIRECTOR
I'd be curious is this was the same one that hit 99 for you (I had only
one 44 and most were 10 or 20).