RE: [SAtalk] scoring system and values...

2003-11-13 Thread Larry Gilson
Thanks Chris!

Good to know someone else has a similar experience and thinks this way too!
I am experimenting with Bayes and so far, it does seem to help.  I have been
able to get rid of some custom rules that were creating FPs.

BTW, does Columbia University have any incentive to reduce spam more than
they already do by adding the complexity of Bayes for a population that
size?

--Larry



 -Original Message-
 From: Covington, Chris
 Sent: Wednesday, November 12, 2003 12:10 PM
 To: Larry Gilson; [EMAIL PROTECTED]
 Subject: RE: [SAtalk] scoring system and values...
 
 
 Definitely FPs.  I think SA has a very difficult time with 
 solicited commercial email, even with Bayes feeding.  I had 
 to up my site-wide installation to 10.0 to get only the worst 
 of the worst and to stop people's solicited Princeline / 
 Day's Inn, etc. hotel confirmations and travel/real estate 
 deals lists from getting tagged.
 
 And it doesn't help that Razor, DCC and Pyzor have a lot of 
 users that report legitimate solicited commercial email as 
 spam (the people that forget to uncheck send me great 
 offers when they order a product from a vendor, and then 
 report those vendors' great offers as spam).
 
 Maybe it's better to not use Bayes at all on a site-wide 
 basis.  I've noticed Columbia University doesn't use Bayes...
 
 Chris 
 
 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] On 
 Behalf Of Larry Gilson
 Sent: Tuesday, November 11, 2003 2:18 PM
 To: [EMAIL PROTECTED]
 Subject: RE: [SAtalk] scoring system and values...
 
 I don't know if this really fits in this subject or not.  
 However, I keep thinking while reading this thread if anyone 
 considers real opt-in advertisements/messages that get tagged 
 by SA (like from OshKosh, Travelocity, Lands' End, etc.) to 
 be a FP or not.  Do site-wide Bayes installs have a hard time 
 differentiating without feeding?
 
 Thanks,
 Larry
 



---
This SF.Net email sponsored by: ApacheCon 2003,
16-19 November in Las Vegas. Learn firsthand the latest
developments in Apache, PHP, Perl, XML, Java, MySQL,
WebDAV, and more! http://www.apachecon.com/
___
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk


RE: [SAtalk] scoring system and values...

2003-11-13 Thread Chris Santerre
I have to agree even without Bayes! I have spammy legit mail that isn't
business related get flagged a lot. It is a real PITA because it effects my
evilrules. Same thing, users don't uncheck the send me all sorts of junk
from your network partners when they sign up for something. 

I've had conversations with legit mass emailers that have to fight getting
there legit opt-in only email to customers, because of antispam software.
I've been wanting to write a guide to legit emailers on how to make there
emails seem less spammy. I think legit companies need to realise there are
certain guidlines they should follow when writing legit emails. It is a
price to pay now, with all the antispam efforts. 

Many really have no idea how the antispam stuff works, and what they need to
do to reduce there own FPs on sent mail. The company I had talked to was in
the medical field, and they have a hell of a time as you can imagine. 

I guess it is the whole Fish or fishing pole moral. Teach the legit
mailers how to not hit the basic rules. That should take care of a lot of
FPs right there. 

--Chris Santerre



 -Original Message-
 From: Covington, Chris [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, November 12, 2003 12:10 PM
 To: Larry Gilson; [EMAIL PROTECTED]
 Subject: RE: [SAtalk] scoring system and values...
 
 
 Definitely FPs.  I think SA has a very difficult time with solicited
 commercial email, even with Bayes feeding.  I had to up my site-wide
 installation to 10.0 to get only the worst of the worst and to stop
 people's solicited Princeline / Day's Inn, etc. hotel 
 confirmations and
 travel/real estate deals lists from getting tagged.
 
 And it doesn't help that Razor, DCC and Pyzor have a lot of users that
 report legitimate solicited commercial email as spam (the people that
 forget to uncheck send me great offers when they order a 
 product from
 a vendor, and then report those vendors' great offers as spam).
 
 Maybe it's better to not use Bayes at all on a site-wide basis.  I've
 noticed Columbia University doesn't use Bayes...
 
 Chris 
 
 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] On Behalf Of
 Larry Gilson
 Sent: Tuesday, November 11, 2003 2:18 PM
 To: [EMAIL PROTECTED]
 Subject: RE: [SAtalk] scoring system and values...
 
 I don't know if this really fits in this subject or not.  However, I
 keep
 thinking while reading this thread if anyone considers real opt-in
 advertisements/messages that get tagged by SA (like from OshKosh,
 Travelocity, Lands' End, etc.) to be a FP or not.  Do site-wide Bayes
 installs have a hard time differentiating without feeding?
 
 Thanks,
 Larry
 
 
 ---
 This SF.Net email sponsored by: ApacheCon 2003,
 16-19 November in Las Vegas. Learn firsthand the latest
 developments in Apache, PHP, Perl, XML, Java, MySQL,
 WebDAV, and more! http://www.apachecon.com/
 ___
 Spamassassin-talk mailing list
 [EMAIL PROTECTED]
 https://lists.sourceforge.net/lists/listinfo/spamassassin-talk
 


---
This SF.Net email sponsored by: ApacheCon 2003,
16-19 November in Las Vegas. Learn firsthand the latest
developments in Apache, PHP, Perl, XML, Java, MySQL,
WebDAV, and more! http://www.apachecon.com/
___
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk


Re: [SAtalk] scoring system and values...

2003-11-13 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Chris Santerre writes:
I've had conversations with legit mass emailers that have to fight getting
there legit opt-in only email to customers, because of antispam software.
I've been wanting to write a guide to legit emailers on how to make there
emails seem less spammy. I think legit companies need to realise there are
certain guidlines they should follow when writing legit emails. It is a
price to pay now, with all the antispam efforts. 

Like this?

  http://spam.abuse.net/marketerhelp/bulk-howto.shtml

Unfortunately, it's not well-linked.  I've just added a link to the
FAQ...

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.2 (GNU/Linux)
Comment: Exmh CVS

iD8DBQE/s9ulQTcbUG5Y7woRAj2CAJ93sPFfeCi3fo1Um+b/eYQZJZ3bYgCgiZqj
mZei6k5Q3mW6DdQbbuB/dJI=
=kg32
-END PGP SIGNATURE-



---
This SF.Net email sponsored by: ApacheCon 2003,
16-19 November in Las Vegas. Learn firsthand the latest
developments in Apache, PHP, Perl, XML, Java, MySQL,
WebDAV, and more! http://www.apachecon.com/
___
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk


RE: [SAtalk] scoring system and values...

2003-11-12 Thread Covington, Chris
Definitely FPs.  I think SA has a very difficult time with solicited
commercial email, even with Bayes feeding.  I had to up my site-wide
installation to 10.0 to get only the worst of the worst and to stop
people's solicited Princeline / Day's Inn, etc. hotel confirmations and
travel/real estate deals lists from getting tagged.

And it doesn't help that Razor, DCC and Pyzor have a lot of users that
report legitimate solicited commercial email as spam (the people that
forget to uncheck send me great offers when they order a product from
a vendor, and then report those vendors' great offers as spam).

Maybe it's better to not use Bayes at all on a site-wide basis.  I've
noticed Columbia University doesn't use Bayes...

Chris 

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of
Larry Gilson
Sent: Tuesday, November 11, 2003 2:18 PM
To: [EMAIL PROTECTED]
Subject: RE: [SAtalk] scoring system and values...

I don't know if this really fits in this subject or not.  However, I
keep
thinking while reading this thread if anyone considers real opt-in
advertisements/messages that get tagged by SA (like from OshKosh,
Travelocity, Lands' End, etc.) to be a FP or not.  Do site-wide Bayes
installs have a hard time differentiating without feeding?

Thanks,
Larry


---
This SF.Net email sponsored by: ApacheCon 2003,
16-19 November in Las Vegas. Learn firsthand the latest
developments in Apache, PHP, Perl, XML, Java, MySQL,
WebDAV, and more! http://www.apachecon.com/
___
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk


RE: [SAtalk] scoring system and values... (Bayes scoring)

2003-11-12 Thread Smart,Dan
I don't use Razor or Pyzor partly for this reason, and partly due to delay
issues.

By the way...

When discussing why certain rules have certain scores, the set of scores
that make no sense to me is the scoring given to Bayes:

--50_scores.cf --
score BAYES_00 0 0 -4.901 -4.900
score BAYES_01 0 0 -0.600 -1.524
score BAYES_10 0 0 -0.734 -0.908
score BAYES_20 0 0 -0.127 -1.428
score BAYES_30 0 0 -0.349 -0.904
score BAYES_40 0 0 -0.001 -0.001
score BAYES_44 0 0 -0.001 -0.001
score BAYES_50 0 0 0.001 0.001
score BAYES_56 0 0 0.001 0.001
score BAYES_60 0 0 1.789 1.592
score BAYES_70 0 0 2.142 2.255
score BAYES_80 0 0 2.442 1.657
score BAYES_90 0 0 2.454 2.101
score BAYES_99 0 0 5.400 5.400
--

Why is BAYES_00 not = -1*BAYES_99 ?
Why would BAYES_70 score higher than BAYES_80 or BAYES_90?
Same with BAYES_20 and BAYES_10.

I can only assume those who trained the Bayes filter before running the GA
trained with a bad corpus.

Also, I felt granularity should be finer as you approach 100% since it takes
a whole normal standard deviation to get from 98% to 99%, and values should
be the same on each side of 50%.

I've updated/rescored the following rules as defined below:

 local.cf --
body BAYES_01   eval:check_bayes('0.01', '0.02')
body BAYES_02   eval:check_bayes('0.02', '0.10')
body BAYES_98   eval:check_bayes('0.98', '0.99')
body BAYES_90   eval:check_bayes('0.90', '0.98')
score BAYES_00  -5.4
score BAYES_01  -4.0
score BAYES_02  -3.0
score BAYES_10  -2.5
score BAYES_80   2.5
score BAYES_90   3.0
score BAYES_98   4.0
score BAYES_99   5.4
-

Dan


 

| -Original Message-
| From: Covington, Chris [mailto:[EMAIL PROTECTED] 
| Sent: Wednesday, November 12, 2003 11:10 AM
| To: Larry Gilson; [EMAIL PROTECTED]
| Subject: RE: [SAtalk] scoring system and values...
| 
| Definitely FPs.  I think SA has a very difficult time with 
| solicited commercial email, even with Bayes feeding.  I had 
| to up my site-wide installation to 10.0 to get only the worst 
| of the worst and to stop people's solicited Princeline / 
| Day's Inn, etc. hotel confirmations and travel/real estate 
| deals lists from getting tagged.
| 
| And it doesn't help that Razor, DCC and Pyzor have a lot of 
| users that report legitimate solicited commercial email as 
| spam (the people that forget to uncheck send me great 
| offers when they order a product from a vendor, and then 
| report those vendors' great offers as spam).
| 
| Maybe it's better to not use Bayes at all on a site-wide 
| basis.  I've noticed Columbia University doesn't use Bayes...
| 
| Chris 
| 
| -Original Message-
| From: [EMAIL PROTECTED]
| [mailto:[EMAIL PROTECTED] On 
| Behalf Of Larry Gilson
| Sent: Tuesday, November 11, 2003 2:18 PM
| To: [EMAIL PROTECTED]
| Subject: RE: [SAtalk] scoring system and values...
| 
| I don't know if this really fits in this subject or not.  
| However, I keep thinking while reading this thread if anyone 
| considers real opt-in advertisements/messages that get tagged 
| by SA (like from OshKosh, Travelocity, Lands' End, etc.) to 
| be a FP or not.  Do site-wide Bayes installs have a hard time 
| differentiating without feeding?
| 
| Thanks,
| Larry
| 
| 
| ---
| This SF.Net email sponsored by: ApacheCon 2003,
| 16-19 November in Las Vegas. Learn firsthand the latest 
| developments in Apache, PHP, Perl, XML, Java, MySQL, WebDAV, 
| and more! http://www.apachecon.com/ 
| ___
| Spamassassin-talk mailing list
| [EMAIL PROTECTED]
| https://lists.sourceforge.net/lists/listinfo/spamassassin-talk
| 


---
This SF.Net email sponsored by: ApacheCon 2003,
16-19 November in Las Vegas. Learn firsthand the latest
developments in Apache, PHP, Perl, XML, Java, MySQL,
WebDAV, and more! http://www.apachecon.com/
___
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk


Re: [SAtalk] scoring system and values...

2003-11-11 Thread Terry Milnes
I have been considering using the bayes site wide, however I have seen a 
lot of opinions that oppose its use this way. Furthermore I did/do have 
doubts as to how well it would work.

There is no way that I can allow users to add their own mail to the 
corpus, they'll screw it up.

I guess I could start with my corpus and add to it as I do mine, and 
watch the test accounts to see how well it works.

Thanks 

David B Funk wrote:
On Sat, 8 Nov 2003, Terry Milnes wrote:


The bayes filtering works great, but the typical user is not going to
want to jump through what he would consider the huge obstacles to train
a corpus. Furthermore implementing bayes on a system that incorporates
thousands of users can be a daunting task, and isn't even an available
option to some of us.


It is true that Bayes works best if you can customize it on a per-user
basis bit that is NOT necessary. It DOES work even when left to run
on a site-wide basis with just the training from auto-learn.
As an administrator running SA with Bayes site-wide on a system that
processes tens of thousands of messages a day for thousands of users
with no per-user configs, I know of what I speak.
If you cannot do any hand-correcting (re-feed it ham/spam to correct
mistakes) you might want to adjust the scores so that just a
Bayes score cannot be responsible for the total determination of
'spam'.  IE with the default spam threshold ==5 and default Bayes 100%
score ==5.4, a mistake in Bayes learning could be soely responsible
for a message being marked as 'spam'.
So crank up your spam threshold to 6 or so to require some other rules
to corroborate the Bayes assessment.
Bayes does use up a bit of memory and CPU, but it's small potatoes
compaired to some of the add-in rules that have been discussed on
this list (Hi Chris ;).
So please give me one good reason why you say Bayes:
 isn't even an available option to some of us


---
This SF.Net email sponsored by: ApacheCon 2003,
16-19 November in Las Vegas. Learn firsthand the latest
developments in Apache, PHP, Perl, XML, Java, MySQL,
WebDAV, and more! http://www.apachecon.com/
___
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk


RE: [SAtalk] scoring system and values...

2003-11-11 Thread Smart,Dan
I use Bayes for site-wide and love it.  I have a recipe in my Procmail to
grab any message that scores between 6 and 10, and store it in a suspect
MBOX.  Once a week I look through this for false positives and move to a
ConfirmedHam MBOX, and move the rest to ConfirmedSPAM MBOX.  I look at
the false positives to see why they got caught.  Since I'm sitewide, I'd
rather let through a few SPAMs than any false positives.

I run Bayes, RBLs, and DCC. 

I use a threshold of 6.5, as this gives a 0.01% FP rate.

I run this on a machine with an 800MHz CPU, RH7.3, 384MB RAM. With Postfix
and John Hardin's Sanitizer script for Procmail.  I run SA through Procmail.


My daily processing stats are:
Total number of emails processed by the spam filter : 43724
Number of spams : 29792 ( 68.14%)
Number of clean messages: 13932 ( 31.86%)
Average message analysis time   :  4.08 seconds
Average spam analysis time  :  3.74 seconds
Average clean message analysis time :  4.70 seconds
Average message score   : 12.38
Average spam score  : 21.93
Average clean message score : -5.08

Never have a performance problem

My users are very sensitive to P0rn and V1agra ads, so I've jacked these
values up at the risk of catching foul real messages.  We're a company, so
that's not a problem.

As a question for the original poster, why not build a corpus and run the GA
scoring engine yourself?  That's what I'm working on now to improve my
local, real world scores.  No more guesswork.  I will know the rules with
little FP, and can crank them up.

Dan


 

| -Original Message-
| From: Terry Milnes [mailto:[EMAIL PROTECTED] 
| Sent: Tuesday, November 11, 2003 7:08 AM
| To: David B Funk
| Cc: [EMAIL PROTECTED]
| Subject: Re: [SAtalk] scoring system and values...
| 
| I have been considering using the bayes site wide, however I 
| have seen a lot of opinions that oppose its use this way. 
| Furthermore I did/do have doubts as to how well it would work.
| 


---
This SF.Net email sponsored by: ApacheCon 2003,
16-19 November in Las Vegas. Learn firsthand the latest
developments in Apache, PHP, Perl, XML, Java, MySQL,
WebDAV, and more! http://www.apachecon.com/
___
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk


RE: [SAtalk] scoring system and values...

2003-11-11 Thread Larry Gilson
I don't know if this really fits in this subject or not.  However, I keep
thinking while reading this thread if anyone considers real opt-in
advertisements/messages that get tagged by SA (like from OshKosh,
Travelocity, Lands' End, etc.) to be a FP or not.  Do site-wide Bayes
installs have a hard time differentiating without feeding?

Thanks,
Larry



---
This SF.Net email sponsored by: ApacheCon 2003,
16-19 November in Las Vegas. Learn firsthand the latest
developments in Apache, PHP, Perl, XML, Java, MySQL,
WebDAV, and more! http://www.apachecon.com/
___
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk


RE: [SAtalk] scoring system and values...

2003-11-11 Thread Smart,Dan
Since we are a company, I don't get too hung up over these.  I more worry
about newsletters people get for their work.  I usually just delete the
non-business stuff, and don't run them as either spam or ham.

Dan

| -Original Message-
| From: Larry Gilson [mailto:[EMAIL PROTECTED] 
| Sent: Tuesday, November 11, 2003 1:18 PM
| 
| I don't know if this really fits in this subject or not.  
| However, I keep thinking while reading this thread if anyone 
| considers real opt-in advertisements/messages that get tagged 
| by SA (like from OshKosh, Travelocity, Lands' End, etc.) to 
| be a FP or not.  Do site-wide Bayes installs have a hard time 
| differentiating without feeding?
| 


---
This SF.Net email sponsored by: ApacheCon 2003,
16-19 November in Las Vegas. Learn firsthand the latest
developments in Apache, PHP, Perl, XML, Java, MySQL,
WebDAV, and more! http://www.apachecon.com/
___
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk


RE: [SAtalk] scoring system and values...

2003-11-08 Thread Stewart, John

Okay, THIS is a little silly for sourceforge, at least for the SA list:

[EMAIL PROTECTED]: host
mail.sourceforge.net[66.35.250.206] said: 550-This message matches a
blacklisted regular expression ([Vv] *[Ii] *[Aa] 550 *[Gg] *[Rr] *[Aa])
(in
reply to end of DATA command)

(now re-editing to remove the offending word - hopefully a 1 replacing the i
will suffice. I feel like a spammer now)


 Well, I'll grant you that much although I did study it a fair
 amount. But  let's look at another aspect here too. There is not
 a single rule that scores  higher than 4.999. That is plain wrong
 in my book; let's say we encounter the word vicodin (which is
 totally absent in the current rules by the way!). 

First off, I personally don't see any spam advertising vicodin (v1agra,
yes!), and this is something that I *have* discussed via email (I broke my
toe). So it would be a huge false positive generator for me.

In regards to your rules point:

spaminator% grep BAYES_99 /usr/share/spamassassin/50_scores.cf
score BAYES_99 0 0 5.400 5.400

I think bayes is your answer (as others have suggested).

You certainly can tweak scores that you think necessary, but do be careful!
I've done this myself in the past and ended up generating more false
positives that way, from phrases you would never think would be in ham.

johnS


---
This SF.Net email sponsored by: ApacheCon 2003,
16-19 November in Las Vegas. Learn firsthand the latest
developments in Apache, PHP, Perl, XML, Java, MySQL,
WebDAV, and more! http://www.apachecon.com/
___
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk


Re: [SAtalk] scoring system and values...

2003-11-08 Thread Terry Milnes

Now I don't expect SA to know dutch; that would be unfair. But what I
would
like is some way to score those english terms way higher than an american
would or could.  For an american, mortgage does not spell spam per se. But
for ME it does, and I can practically guarantee I will not ever get an
email
that mentions mortgage together with you have been approved which
won't
be spam.
 
At the risk of being repetitive, this is precisely the sort of thing bayes
excels at.  Give it a shot (hopefully you have some ham'n'spam saved up
already), I think you will be pleased.

Well, none of this is your concern of course. But I would really really
Perhaps it's true that your success is not directly anyone's concern but
your own.  However, the regulars on this list are basically a buncha SA
users who are trying to improve their results and help others do the same
along the way.
And herin lies the problem, sure anybody who is willing to spend time 
tweaking their personal setups, training bayes etc. will have great 
success at filtering out spam.

Some of us though are system administrators and need a solution to offer 
to the end users.  The typical end user wants to open their email and 
see no spam, period.

Presently without the tweaks and training all we can do is reduce his 
spam by about 50 - 60%.  Settings have to be left at conservative in 
order not to get the phone calls complaining about false positives.

The bayes filtering works great, but the typical user is not going to 
want to jump through what he would consider the huge obstacles to train 
a corpus. Furthermore implementing bayes on a system that incorporates 
thousands of users can be a daunting task, and isn't even an available 
option to some of us.

Therefore when someone asks if there is a method to improve on the basic 
ruleset we should pay more attention, not just recommend he use bayes.

tm.


really
like if there was a way to have those typical english spam-words score way
higher than they do now.  Could we maybe envision two rulesets, one for
english-speaking residents and one for non-english speaking residents...?
I edited the score file myself but not only is it a hard, long and
error-prone
task, but by editing it I throw away much of the valueable knowhow which
assembled that score-list in the first place.  But I am faced with the
fact
that over 95% of my spam is in english and that I cannot sit back while
the
online pharmacies fly around me, so to speak.
Put yourself in my (our, if i'd be speaking for all non-english countries)
place and ask yourself this question: Would you accept a score of only 0.5
for a rule that says gratis hypotheekadvies or vijf miljoen
emailadressen
??  No, of course you wouldn't, because you'd know that a company that
pretends to sell you a mortgage from 12000 miles away will never ever be a
genuine offer...


Knowing that there are regulars on this list who's primary language is NOT
English, anyone care to share how their setup handles English and
non-English spam?






---
This SF.Net email sponsored by: ApacheCon 2003,
16-19 November in Las Vegas. Learn firsthand the latest
developments in Apache, PHP, Perl, XML, Java, MySQL,
WebDAV, and more! http://www.apachecon.com/
___
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk


RE: [SAtalk] scoring system and values...

2003-11-08 Thread Dan Kohn
I think one of the greatest areas of confusion about SpamAssassin today
is how well Bayes can work with absolutely no training whatsoever.
Specifically, because it autotrains only on very spammy and very hammy
messages, Bayes learns quite well without any hand-selected corpuses.

The magic of SpamAssassin is that the Bayes bootstraps its learning off
of the several hundred non-Bayes rules, including the use of DNS
blocklists. So, spammy messages that hit certain rules train the Bayes
to find similar spam in the future, even the future spam may not hit
those rules.  The rules are necessary to auto-train the Bayes, but the
Bayes is what can catch all borderline (and custom-crafted) spam going
forward.

You still probably want to give users a way to manually create
whitelists to avoid false positives, and I recommend allowing
mistake-based training of the small number of false negatives (I use
procmail for this as described at
http://www.dankohn.com/archives/000323.htm).  

But there's no reason to think that your average user ever needs to
understand what Bayes in order to be able to take full advantage of it.

As to performance, I can't speak to administering a Bayes machine with
thousands of users, but if 99.9+% true positives with essentially no
false negatives is important to them, then it may be work finding a
$1000 server to just run spamd.  For me, SA is now eliminating over 200
spam a day, so email would be utterly unusable without it.

  - dan
--
Dan Kohn mailto:[EMAIL PROTECTED]
http://www.dankohn.com/  tel:+1-650-327-2600
-Original Message-
From: Terry Milnes [mailto:[EMAIL PROTECTED] 
Sent: Saturday, November 08, 2003 05:47
Cc: [EMAIL PROTECTED]
Subject: Re: [SAtalk] scoring system and values...


Now I don't expect SA to know dutch; that would be unfair. But what I
would
like is some way to score those english terms way higher than an
american
would or could.  For an american, mortgage does not spell spam per se.
But
for ME it does, and I can practically guarantee I will not ever get an
email
that mentions mortgage together with you have been approved which
won't
be spam.
  
 At the risk of being repetitive, this is precisely the sort of thing
bayes
 excels at.  Give it a shot (hopefully you have some ham'n'spam saved
up
 already), I think you will be pleased.

Well, none of this is your concern of course. But I would really
really
 
 Perhaps it's true that your success is not directly anyone's concern
but
 your own.  However, the regulars on this list are basically a buncha
SA
 users who are trying to improve their results and help others do the
same
 along the way.
 

And herin lies the problem, sure anybody who is willing to spend time 
tweaking their personal setups, training bayes etc. will have great 
success at filtering out spam.

Some of us though are system administrators and need a solution to offer

to the end users.  The typical end user wants to open their email and 
see no spam, period.

Presently without the tweaks and training all we can do is reduce his 
spam by about 50 - 60%.  Settings have to be left at conservative in 
order not to get the phone calls complaining about false positives.

The bayes filtering works great, but the typical user is not going to 
want to jump through what he would consider the huge obstacles to train 
a corpus. Furthermore implementing bayes on a system that incorporates 
thousands of users can be a daunting task, and isn't even an available 
option to some of us.

Therefore when someone asks if there is a method to improve on the basic

ruleset we should pay more attention, not just recommend he use bayes.

tm.


really
like if there was a way to have those typical english spam-words score
way
higher than they do now.  Could we maybe envision two rulesets, one
for
english-speaking residents and one for non-english speaking
residents...?
I edited the score file myself but not only is it a hard, long and
error-prone
task, but by editing it I throw away much of the valueable knowhow
which
assembled that score-list in the first place.  But I am faced with the
fact
that over 95% of my spam is in english and that I cannot sit back
while
the
online pharmacies fly around me, so to speak.
Put yourself in my (our, if i'd be speaking for all non-english
countries)
place and ask yourself this question: Would you accept a score of only
0.5
for a rule that says gratis hypotheekadvies or vijf miljoen
emailadressen
??  No, of course you wouldn't, because you'd know that a company that
pretends to sell you a mortgage from 12000 miles away will never ever
be a
genuine offer...
 
 
 
 Knowing that there are regulars on this list who's primary language is
NOT
 English, anyone care to share how their setup handles English and
 non-English spam?
 
 
 
 
 



---
This SF.Net email sponsored by: ApacheCon 2003,
16-19 November in Las Vegas. Learn firsthand the latest
developments in Apache, PHP, Perl

Re: [SAtalk] scoring system and values...

2003-11-07 Thread Matt Kettler
At 10:29 AM 11/7/2003, Maarten J H van den Berg wrote:

Sorry if this has been discussed in the past...
It's been discussed many times.. It's very common for people to have a very 
deep misunderstanding of how SA scoring works. Most people fall into the 
trap of over-simplifying the problem, and simply assuming that some rule or 
another must be a good spam rule, when in fact it's not.

Of course this is open to debate, but then again that's all I want;
possibly a debate about how accurate the scoring is right now...
That's fine.. but in the next round you're going to have to do a LOT more 
homework.. you're over-simplifying things by merely looking at the name of 
the rule... You're not looking at it's performance levels, it's impact on 
nonspam, or it's interactions with other rules.

Questioning the accuracy of the scoring system isn't unreasonable.. but the 
scoring system is VASTLY more complicated than you can understand in a few 
hours of study. You need to have a good understanding of how it really 
works, and just how complicated the balance of the scoring system is before 
you can make reasonable judgements about accuracy.

You need to realize the SA scoring system is somewhat analogous to curve 
fitting an equation with 873 variables (there are 873 rules in SA 2.60's 
50_scores.cf). This is done as an approximation using a genetic algorithm 
to evolve a solution, since a direct solution would take too long to 
compute. Trying to get your mind completely around an equation with that 
many variables is not possible for most humans, including me, but I've 
learned to understand and respect how complex the problem is.


List 1:
score ALL_CAP_PORN 0.650 0.669 0 0
score PENIS_ENLARGE2 0.500 0.590 0 0.501
score UPPERCASE_50_75 0.794 1.137 0 0
score V+AG+A_ONLINE 1.100 1.101 3.151 4.056
If it were up to me, I'd say that giving only half a point to a mail that
scores PENIS_ENLARGE2 is...  well, ludicrous.  Let's not kid ourselves.
IF there are people who participate on a genuine mailinglist that
discusses penis enlargement, let the burden fall on them to put those
adresses in their whitelist, not the reverse.
OK, being that it's not up to you, let's look at the real-world performance 
of these rules from STATISTICS.txt

OVERALL%   SPAM% HAM% S/ORANK   SCORE  NAME
  1.010   1.5010   0.08930.944   0.800.65  ALL_CAP_PORN
  2.962   4.5216   0.04180.991   0.930.50  PENIS_ENLARGE2
  0.580   0.8552   0.06450.930   0.770.79  UPPERCASE_50_75
  1.040   1.5930   0.00320.998   0.951.10  V+AG+A_ONLINE
*yawn*.. none of these rules has particularly impressive hit rates, so they 
aren't very significant in the grand scheme of SA. A meager 4.5% of spam 
hits isn't impressive, although not useless.

Some of them, such as ALL_CAP_PORN and UPPERCASE_50_75 have really bad 
quantities of nonspam hits. Anything with a S/O under 90 pretty much 
doesn't deserve a high score because 10% of the email that the rule matches 
is nonspam. In the case of these two, both have at least 20% of their hits 
being nonspam mail.. ouch.

Quite frankly, UPPERCASE_50_75 performs so badly it doesn't even meet the 
criteria to avoid being dropped from the ruleset, but is probably retained 
for completeness with the other rules. (in general spam rules need to have 
an S/O of .80 or higher to be deemed worthwhile.. anything less isn't a 
very good indicator of spam and is just a waste of time).

In the case of the other two, you need to start looking at the larger 
ecosystem of the entire ruleset.. SA rules are not scored based on the 
merits of the rule alone.. the entire ruleset is scored together, and the 
scores of all the rules are tuned to try to get the most spam and nonspam 
placed in the proper piles.

Often times the score of a rule is the result of it's interaction with 
other rules. Take our PENIS_ENLARGE2 rule. This rule can quite possibly 
match some nonspam crude joke emails.. Other spam rules will likely match 
these as well, resulting in a high score.

Now, the GA is designed to treat false positives as 100 times worse than 
false negatives, so this is a very drastic situation for the GA. Faced with 
this problem, the proper thing for the GA to do is to try to reduce the 
score of the rule that affects the least amount of the spam pile.. well, 
given that PENIS_ENLARGE2 only matches 4.5% of spam, it's a good candidate 
for reduction.











---
This SF.Net email sponsored by: ApacheCon 2003,
16-19 November in Las Vegas. Learn firsthand the latest
developments in Apache, PHP, Perl, XML, Java, MySQL,
WebDAV, and more! http://www.apachecon.com/
___
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk


RE: [SAtalk] scoring system and values...

2003-11-07 Thread Scott Sprunger
Wow!  Matt this is an incredibly informative post.  Thank you!

-Original Message-
From: Matt Kettler [mailto:[EMAIL PROTECTED]
Sent: Friday, November 07, 2003 12:43 PM
To: [EMAIL PROTECTED]; [EMAIL PROTECTED]
Subject: Re: [SAtalk] scoring system and values...


At 10:29 AM 11/7/2003, Maarten J H van den Berg wrote:

Sorry if this has been discussed in the past...

It's been discussed many times.. It's very common for people to have a very 
deep misunderstanding of how SA scoring works. Most people fall into the 
trap of over-simplifying the problem, and simply assuming that some rule or 
another must be a good spam rule, when in fact it's not.

Of course this is open to debate, but then again that's all I want;
possibly a debate about how accurate the scoring is right now...

That's fine.. but in the next round you're going to have to do a LOT more 
homework.. you're over-simplifying things by merely looking at the name of 
the rule... You're not looking at it's performance levels, it's impact on 
nonspam, or it's interactions with other rules.

Questioning the accuracy of the scoring system isn't unreasonable.. but the 
scoring system is VASTLY more complicated than you can understand in a few 
hours of study. You need to have a good understanding of how it really 
works, and just how complicated the balance of the scoring system is before 
you can make reasonable judgements about accuracy.

You need to realize the SA scoring system is somewhat analogous to curve 
fitting an equation with 873 variables (there are 873 rules in SA 2.60's 
50_scores.cf). This is done as an approximation using a genetic algorithm 
to evolve a solution, since a direct solution would take too long to 
compute. Trying to get your mind completely around an equation with that 
many variables is not possible for most humans, including me, but I've 
learned to understand and respect how complex the problem is.


List 1:
score ALL_CAP_PORN 0.650 0.669 0 0
score PENIS_ENLARGE2 0.500 0.590 0 0.501
score UPPERCASE_50_75 0.794 1.137 0 0
score V+AG+A_ONLINE 1.100 1.101 3.151 4.056

If it were up to me, I'd say that giving only half a point to a mail that
scores PENIS_ENLARGE2 is...  well, ludicrous.  Let's not kid ourselves.
IF there are people who participate on a genuine mailinglist that
discusses penis enlargement, let the burden fall on them to put those
adresses in their whitelist, not the reverse.

OK, being that it's not up to you, let's look at the real-world performance 
of these rules from STATISTICS.txt

OVERALL%   SPAM% HAM% S/ORANK   SCORE  NAME
   1.010   1.5010   0.08930.944   0.800.65  ALL_CAP_PORN
   2.962   4.5216   0.04180.991   0.930.50  PENIS_ENLARGE2
   0.580   0.8552   0.06450.930   0.770.79  UPPERCASE_50_75
   1.040   1.5930   0.00320.998   0.951.10  V+AG+A_ONLINE

*yawn*.. none of these rules has particularly impressive hit rates, so they 
aren't very significant in the grand scheme of SA. A meager 4.5% of spam 
hits isn't impressive, although not useless.

Some of them, such as ALL_CAP_PORN and UPPERCASE_50_75 have really bad 
quantities of nonspam hits. Anything with a S/O under 90 pretty much 
doesn't deserve a high score because 10% of the email that the rule matches 
is nonspam. In the case of these two, both have at least 20% of their hits 
being nonspam mail.. ouch.

Quite frankly, UPPERCASE_50_75 performs so badly it doesn't even meet the 
criteria to avoid being dropped from the ruleset, but is probably retained 
for completeness with the other rules. (in general spam rules need to have 
an S/O of .80 or higher to be deemed worthwhile.. anything less isn't a 
very good indicator of spam and is just a waste of time).

In the case of the other two, you need to start looking at the larger 
ecosystem of the entire ruleset.. SA rules are not scored based on the 
merits of the rule alone.. the entire ruleset is scored together, and the 
scores of all the rules are tuned to try to get the most spam and nonspam 
placed in the proper piles.

Often times the score of a rule is the result of it's interaction with 
other rules. Take our PENIS_ENLARGE2 rule. This rule can quite possibly 
match some nonspam crude joke emails.. Other spam rules will likely match 
these as well, resulting in a high score.

Now, the GA is designed to treat false positives as 100 times worse than 
false negatives, so this is a very drastic situation for the GA. Faced with 
this problem, the proper thing for the GA to do is to try to reduce the 
score of the rule that affects the least amount of the spam pile.. well, 
given that PENIS_ENLARGE2 only matches 4.5% of spam, it's a good candidate 
for reduction.












---
This SF.Net email sponsored by: ApacheCon 2003,
16-19 November in Las Vegas. Learn firsthand the latest
developments in Apache, PHP, Perl, XML, Java, MySQL,
WebDAV, and more! http://www.apachecon.com

Re: [SAtalk] scoring system and values...

2003-11-07 Thread maarten van den Berg
On Friday 07 November 2003 18:43, Matt Kettler wrote:
 At 10:29 AM 11/7/2003, Maarten J H van den Berg wrote:
 Sorry if this has been discussed in the past...

 It's been discussed many times.. It's very common for people to have a very
 deep misunderstanding of how SA scoring works. Most people fall into the
 trap of over-simplifying the problem, and simply assuming that some rule or
 another must be a good spam rule, when in fact it's not.

Well...  I do of course understand that filtering spam entails more than 
kill_anything_suspicious  ;-)  The problem is deepened by the rule that 
above all you want no false positives, which is very good in and of itself.
The fact that you 'train' spamassassin by looking at a LOT of ham and spam and 
derive your rulesets from that is also (very!) commendable. 
But put yourself in my place. Upon looking at those rules I see al LOT of 
inconsistencies. For instance, I found these rules that have score of zero(!) 
(and these are merely the top of a large iceberg)

score CASHCASHCASH 0
score ADDRESSES_ON_CD 0
score BLANK_LINES_90_100 0
score EJACULATION 0
score HERBAL_V+AG+A 0

One could argue that yelling CASH CASH CASH is a valid sales pitch in a normal 
mail. But hey, are we being realistic here ?  How could anything but spam 
have this property ?  For addresses_on_cd one could argue that it IS possible 
to have such a statement in a regular email (albeit that's already stretching 
it) but then I would retort that although possible it would stand to reason 
to give it at LEAST a score of 0.5 or so, but not _zero_!  And the third, 
well, it could be a misconfigured client, but still, is an email that is 90% 
thin air worth of being treated as a valid email?  And the fourth...  of 
course you will find ejaculation in many many forums but, again, give it at 
least some low figure but NOT equal zero...
And...  well I won't even go into the fifth rule... come on ;-)

 Of course this is open to debate, but then again that's all I want;
 possibly a debate about how accurate the scoring is right now...

 That's fine.. but in the next round you're going to have to do a LOT more
 homework.. you're over-simplifying things by merely looking at the name of
 the rule... You're not looking at it's performance levels, it's impact on
 nonspam, or it's interactions with other rules.

You do not know how much I crosschecked. But I have to admit I'm new to this 
list so yeah, I do understand your criticism.  But I have lloked up a lot of 
those rules, just to be sure what they check on _exactly_.

Besides, I WANT to learn, so if you can point me to older discussions about 
this I would definitely appreciate that. (maybe the approximate month, or a 
subject to look for...) I just haven't been able to find it yet.

 Questioning the accuracy of the scoring system isn't unreasonable.. but the
 scoring system is VASTLY more complicated than you can understand in a few
 hours of study. You need to have a good understanding of how it really
 works, and just how complicated the balance of the scoring system is before
 you can make reasonable judgements about accuracy.

Well, I'll grant you that much although I did study it a fair amount. But 
let's look at another aspect here too. There is not a single rule that scores 
higher than 4.999. That is plain wrong in my book; let's say we encounter the 
word vicodin (which is totally absent in the current rules by the way!). 
I would then say let's score that 5.50 immediately and IF it is a regular 
email it must 'prove' that fact by having 'positive' points like known_mua or 
what have you. I'd say let the burden be on the one guy that IS discussing 
vicodin and let him have those addresses whitelisted...   That might be a 
bold statement but let's be realistic here: there is a WAR going on guys...
Giving vicodin the benefit of the doubt is, well, VERY doubtful at best...!

 You need to realize the SA scoring system is somewhat analogous to curve
 fitting an equation with 873 variables (there are 873 rules in SA 2.60's
 50_scores.cf). This is done as an approximation using a genetic algorithm
 to evolve a solution, since a direct solution would take too long to
 compute. Trying to get your mind completely around an equation with that
 many variables is not possible for most humans, including me, but I've
 learned to understand and respect how complex the problem is.

Hum. Okay.  But keep in mind I DO NOT question 95% of the rules. Only, some 
just stick out like a sore thumb. Like, the nigerian spam thingy. Or, better, 
one I discovered during testing: the word v1agr4 (had to spell it this way 
for this list but I mean in the correct spelling here) in the body text is 
not recognized or tagged. Only if it is spelled with a capital V it gets 
tagged. That is not really okay, is it ? 

 List 1:
 score ALL_CAP_PORN 0.650 0.669 0 0
 score PENIS_ENLARGE2 0.500 0.590 0 0.501
 score UPPERCASE_50_75 0.794 1.137 0 0
 score V+AG+A_ONLINE 1.100 1.101 3.151 

Re: [SAtalk] scoring system and values...

2003-11-07 Thread Chris Thielen
Hiya Maarten!

This is going a bit off topic, but the spam I receive is 90% porn.  I
haven't had one slip through in months.  The secret?  Adding bayesian and
network checks into the mix.  Given, I'm the only user on my system as
well as the admin, but I've seen 100% accuracy for quite some time now.

Your frustration is understood and shared.  I've had similar thoughts when
spam hit my inbox at 4.3 points (@#$L!!K, why didn't this rule score
higher?), but I've forgotten about them now that bayes and the network
tests are running so well.  Even those one-line spams or image spams are
getting caught.


Steering back slightly, SA's scores are weighted with a very clear goal:
maximize spam caught; minimize false positives; do this for the majority
of users.  The method used for scoring rules is indeed very real world: SA
users run the rules on their own, private, real-life corpora.  They then
submit the results of the latest rules back to SA.  Then the collective
results are analyzed and appropriate scores computed... scores that have
been optimized for the real-life SA users' real-life corpora.

Finally, if (since) SA does not suit your needs out of the box, it is
designed to be very easily customizable.  It's trivial to change the rules
and tweak the scores to best handle the spam your site and/or users tend
to receive.


I'm not criticizing your viewpoint, but with a little tweaking, SA really
works for me!

--
Chris Thielen

Easily generate SpamAssassin rules to catch obfuscated spam phrases:
http://www.sandgnat.com/cmos/



--snip--
 Nice to exchange thoughts about this though. :-)

 Kind regards,
 Maarten



---
This SF.Net email sponsored by: ApacheCon 2003,
16-19 November in Las Vegas. Learn firsthand the latest
developments in Apache, PHP, Perl, XML, Java, MySQL,
WebDAV, and more! http://www.apachecon.com/
___
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk


Re: [SAtalk] scoring system and values...

2003-11-07 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Chris Thielen writes:
I'm not criticizing your viewpoint, but with a little tweaking, SA really
works for me!

Yeah -- and that's the important thing.  It's really very easy to train
the Bayes stuff, in particular.

The aim is to:

  - work well for *everyone* by default (the default scores)

  - be easily trained for higher accuracy for people who are willing to
spend a little time training it (Bayes)

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.2 (GNU/Linux)
Comment: Exmh CVS

iD8DBQE/rCWsQTcbUG5Y7woRAuGPAKC2lKSLHkis9Z+eB5uLPPsT/jfyhgCdH/2L
p6AFetXPfjZnoBpR5J8nUU8=
=J9Eg
-END PGP SIGNATURE-



---
This SF.Net email sponsored by: ApacheCon 2003,
16-19 November in Las Vegas. Learn firsthand the latest
developments in Apache, PHP, Perl, XML, Java, MySQL,
WebDAV, and more! http://www.apachecon.com/
___
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk


RE: [SAtalk] scoring system and values...

2003-11-07 Thread Tom Meunier
The CASHCASHCASH rule tests for the string '$$$' not for the phrase
CASH! CASH! CASH!
The ADDRESSES_ON_CD rule caught almost as much ham when tested against a
half-million message corpus as it did spam.
The BLANK_LINES_90_100 caught MORE ham than it did spam.

http://search.cpan.org/src/JMASON/Mail-SpamAssassin-2.60/rules/STATISTIC
S-set1.txt

The reality is that you THINK these should be higher, but they're not as
indicative of spam as you THINK they are.  This has been empirically
tested with a statistically significant sample.  Click the link above
and you'll see the results of the testing on that corpus.


I think that since you work in an environment that does not tolerate any
mention of the word v?a?ra you should score these rules higher in your
local.cf file.  That's the beauty of being able to simply put 
score ADDRESSES_ON_CD 97.0 in your own config files.

-tom

 -Original Message-
 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] On 
 Behalf Of maarten van den Berg
 Sent: Friday, November 07, 2003 3:25 PM
 To: [EMAIL PROTECTED]
 Subject: Re: [SAtalk] scoring system and values...
 
 But put yourself in my place. Upon looking at those rules I 
 see al LOT of inconsistencies. For instance, I found these 
 rules that have score of zero(!) (and these are merely the 
 top of a large iceberg)
 
 score CASHCASHCASH 0
 score ADDRESSES_ON_CD 0
 score BLANK_LINES_90_100 0
 score EJACULATION 0
 score HERBAL_V+AG+A 0
 
 One could argue that yelling CASH CASH CASH is a valid sales 
 pitch in a normal mail. But hey, are we being realistic here 
 ?  How could anything but spam have this property ?  For 
 addresses_on_cd one could argue that it IS possible to have 
 such a statement in a regular email (albeit that's already stretching
 it) but then I would retort that although possible it would 
 stand to reason to give it at LEAST a score of 0.5 or so, but 
 not _zero_!  And the third, well, it could be a misconfigured 
 client, but still, is an email that is 90% thin air worth 
 of being treated as a valid email?  And the fourth...  of 
 course you will find ejaculation in many many forums but, 
 again, give it at 
 least some low figure but NOT equal zero...
 And...  well I won't even go into the fifth rule... come on ;-)


---
This SF.Net email sponsored by: ApacheCon 2003,
16-19 November in Las Vegas. Learn firsthand the latest
developments in Apache, PHP, Perl, XML, Java, MySQL,
WebDAV, and more! http://www.apachecon.com/
___
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk


Re: [SAtalk] scoring system and values...

2003-11-07 Thread maarten van den Berg

Following up to myself, since I want to clarify something here...

Another aspect that is relevant to me (but arguably not to most users of SA 
and I'm aware of that...) is that for me, english is not my native language, 
neither am I a resident of an english-speaking country. And because of this, 
my email is mixed; one part is dutch, one part is all the mailinglists I try 
to follow (which are in english).  But not being a resident, the fact is that 
for all of my customers and myself, ANY mail mentioning mortgages, loans,  
ejaculation et al is a surefire sign of spam.  If not it would have mentioned 
hypotheken, leningen and klaarkomen, which are the dutch translations. :-)

Now I don't expect SA to know dutch; that would be unfair. But what I would 
like is some way to score those english terms way higher than an american 
would or could.  For an american, mortgage does not spell spam per se. But 
for ME it does, and I can practically guarantee I will not ever get an email 
that mentions mortgage together with you have been approved which won't 
be spam.
   
Well, none of this is your concern of course. But I would really really really 
like if there was a way to have those typical english spam-words score way 
higher than they do now.  Could we maybe envision two rulesets, one for 
english-speaking residents and one for non-english speaking residents...?
I edited the score file myself but not only is it a hard, long and error-prone 
task, but by editing it I throw away much of the valueable knowhow which 
assembled that score-list in the first place.  But I am faced with the fact 
that over 95% of my spam is in english and that I cannot sit back while the 
online pharmacies fly around me, so to speak.  
Put yourself in my (our, if i'd be speaking for all non-english countries) 
place and ask yourself this question: Would you accept a score of only 0.5 
for a rule that says gratis hypotheekadvies or vijf miljoen emailadressen 
??  No, of course you wouldn't, because you'd know that a company that 
pretends to sell you a mortgage from 12000 miles away will never ever be a 
genuine offer...

In other words, a lot of us get bitten by the fact that mortgage in some 
countries, in some contexts can be non-spam but for the rest of us it is a 
surefire sign to be spam.  And again that is not anyone's fault but we should 
try and make SA flexible enough to accomodate this fact by changing the 
scoring.  I know you can teach SA to recognize spam in ones' own language, 
but what is missing right now is a simple way to make SA much more immune to 
the abundant english spam, which arguably is by FAR the bulk of all spam...

Kind regards,
Maarten


On Friday 07 November 2003 22:21, maarten van den Berg wrote:
 On Friday 07 November 2003 18:43, Matt Kettler wrote:
  At 10:29 AM 11/7/2003, Maarten J H van den Berg wrote:
  Sorry if this has been discussed in the past...
 
  It's been discussed many times.. It's very common for people to have a
  very deep misunderstanding of how SA scoring works. Most people fall into
  the trap of over-simplifying the problem, and simply assuming that some
  rule or another must be a good spam rule, when in fact it's not.

snip

-- 
Yes of course I'm sure it's the red cable. I guarante[^%!/+)F#0c|'NO CARRIER


---
This SF.Net email sponsored by: ApacheCon 2003,
16-19 November in Las Vegas. Learn firsthand the latest
developments in Apache, PHP, Perl, XML, Java, MySQL,
WebDAV, and more! http://www.apachecon.com/
___
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk


Re: [SAtalk] scoring system and values...

2003-11-07 Thread Chris Thielen
maarten van den Berg said:

 Following up to myself, since I want to clarify something here...

 Another aspect that is relevant to me (but arguably not to most users of
 SA
 and I'm aware of that...) is that for me, english is not my native
 language,
 neither am I a resident of an english-speaking country. And because of
 this,
 my email is mixed; one part is dutch, one part is all the mailinglists I
 try
 to follow (which are in english).  But not being a resident, the fact is
 that
 for all of my customers and myself, ANY mail mentioning mortgages, loans,
 ejaculation et al is a surefire sign of spam.  If not it would have
 mentioned
 hypotheken, leningen and klaarkomen, which are the dutch translations. :-)

 Now I don't expect SA to know dutch; that would be unfair. But what I
 would
 like is some way to score those english terms way higher than an american
 would or could.  For an american, mortgage does not spell spam per se. But
 for ME it does, and I can practically guarantee I will not ever get an
 email
 that mentions mortgage together with you have been approved which
 won't
 be spam.

At the risk of being repetitive, this is precisely the sort of thing bayes
excels at.  Give it a shot (hopefully you have some ham'n'spam saved up
already), I think you will be pleased.



 Well, none of this is your concern of course. But I would really really

Perhaps it's true that your success is not directly anyone's concern but
your own.  However, the regulars on this list are basically a buncha SA
users who are trying to improve their results and help others do the same
along the way.

 really
 like if there was a way to have those typical english spam-words score way
 higher than they do now.  Could we maybe envision two rulesets, one for
 english-speaking residents and one for non-english speaking residents...?
 I edited the score file myself but not only is it a hard, long and
 error-prone
 task, but by editing it I throw away much of the valueable knowhow which
 assembled that score-list in the first place.  But I am faced with the
 fact
 that over 95% of my spam is in english and that I cannot sit back while
 the
 online pharmacies fly around me, so to speak.
 Put yourself in my (our, if i'd be speaking for all non-english countries)
 place and ask yourself this question: Would you accept a score of only 0.5
 for a rule that says gratis hypotheekadvies or vijf miljoen
 emailadressen
 ??  No, of course you wouldn't, because you'd know that a company that
 pretends to sell you a mortgage from 12000 miles away will never ever be a
 genuine offer...


Knowing that there are regulars on this list who's primary language is NOT
English, anyone care to share how their setup handles English and
non-English spam?




 In other words, a lot of us get bitten by the fact that mortgage in some
 countries, in some contexts can be non-spam but for the rest of us it is a
 surefire sign to be spam.  And again that is not anyone's fault but we
 should
 try and make SA flexible enough to accomodate this fact by changing the
 scoring.  I know you can teach SA to recognize spam in ones' own language,
 but what is missing right now is a simple way to make SA much more immune
 to
 the abundant english spam, which arguably is by FAR the bulk of all
 spam...

 Kind regards,
 Maarten


 On Friday 07 November 2003 22:21, maarten van den Berg wrote:
 On Friday 07 November 2003 18:43, Matt Kettler wrote:
  At 10:29 AM 11/7/2003, Maarten J H van den Berg wrote:
  Sorry if this has been discussed in the past...
 
  It's been discussed many times.. It's very common for people to have a
  very deep misunderstanding of how SA scoring works. Most people fall
 into
  the trap of over-simplifying the problem, and simply assuming that
 some
  rule or another must be a good spam rule, when in fact it's not.

 snip

--
Chris Thielen

Easily generate SpamAssassin rules to catch obfuscated spam phrases:
http://www.sandgnat.com/cmos/


---
This SF.Net email sponsored by: ApacheCon 2003,
16-19 November in Las Vegas. Learn firsthand the latest
developments in Apache, PHP, Perl, XML, Java, MySQL,
WebDAV, and more! http://www.apachecon.com/
___
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk


Re: [SAtalk] scoring system and values...

2003-11-07 Thread Keith C. Ivey
maarten van den Berg [EMAIL PROTECTED] wrote:

  I know you can teach SA to recognize spam in ones' own
 language, but what is missing right now is a simple way to make
 SA much more immune to the abundant english spam, which arguably
 is by FAR the bulk of all spam...

There is a way: the Bayesian analysis.  If mortgage never 
appears in nonspam and often appears in spam, then messages 
containing the word will very quickly start getting BAYES_99.

-- 
Keith C. Ivey [EMAIL PROTECTED]
Washington, DC



---
This SF.Net email sponsored by: ApacheCon 2003,
16-19 November in Las Vegas. Learn firsthand the latest
developments in Apache, PHP, Perl, XML, Java, MySQL,
WebDAV, and more! http://www.apachecon.com/
___
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk


Re: [SAtalk] scoring system and values...

2003-11-07 Thread Matt Kettler
At 04:25 PM 11/7/2003, maarten van den Berg wrote:
Upon looking at those rules I see al LOT of
inconsistencies. For instance, I found these rules that have score of zero(!)
(and these are merely the top of a large iceberg)
score CASHCASHCASH 0
score ADDRESSES_ON_CD 0
score BLANK_LINES_90_100 0
score EJACULATION 0
score HERBAL_V+AG+A 0
Not to be rude, but did you even READ my email?

You're still not even thinking about looking at STATISTICS.txt.. you're 
still looking at the rule name, and not even glancing at how it performs or 
works.

Until you break yourself of merely looking at the rule name and making 
reaction based only on the name, you're going to be stuck in the rut of 
making bad assumptions and not understanding why.

For god[s?] sake, one of the rules in the above set has a S/O of 0.529.. 
are you joking by suggesting that this rule is a clear indicator of spam??? 
48% of emails this rule matches are nonspam emails. It's not a 
contradiction for it to have a score of 0.. it SHOULD have a score of 0 in 
the most clear way possible.

Do yourself a favor.. look at the STATISTICS.txt file.. it comes with 
spamassassin.. it can explain a lot.

Yes there are a lot of rule scores that seem funny at first glance.. but 
unless you're going to even so much as try to understand what's going on, 
there's not much point in me trying to explain things to you. You're still 
making the same mistakes and not learning anything.



---
This SF.Net email sponsored by: ApacheCon 2003,
16-19 November in Las Vegas. Learn firsthand the latest
developments in Apache, PHP, Perl, XML, Java, MySQL,
WebDAV, and more! http://www.apachecon.com/
___
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk


Re: [SAtalk] scoring system and values...

2003-11-07 Thread Keith C. Ivey
maarten van den Berg [EMAIL PROTECTED] wrote:

 There is not a single rule that scores 
 higher than 4.999. That is plain wrong in my book; let's say we encounter the 
 word vicodin (which is totally absent in the current rules by the way!). 
 I would then say let's score that 5.50 immediately and IF it is a regular 
 email it must 'prove' that fact by having 'positive' points like known_mua or 
 what have you.

There are few rules with negative scores in the default set, 
with good reason -- it's easy for spammers to start using them. 
That's why the MUA tests have essentially disappeared.

If you want to score vicodin as 5.5, you have that ability, 
but it seems better for the default values to be based on 
methodical analysis of actual mail rather than on your personal 
guesses about what words are reasonable to have in spam and 
nonspam mail.

I have a custom rule for vicodin and other drug names, but I 
haven't scored it 5.5.  It is rare for spam to trigger only one 
rule, so a few points are enough.

-- 
Keith C. Ivey [EMAIL PROTECTED]
Washington, DC



---
This SF.Net email sponsored by: ApacheCon 2003,
16-19 November in Las Vegas. Learn firsthand the latest
developments in Apache, PHP, Perl, XML, Java, MySQL,
WebDAV, and more! http://www.apachecon.com/
___
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk


Re: [SAtalk] scoring system and values...

2003-11-07 Thread Keith C. Ivey
Maarten J H van den Berg [EMAIL PROTECTED] wrote:

 List 1:
 score ALL_CAP_PORN 0.650 0.669 0 0
 score PENIS_ENLARGE2 0.500 0.590 0 0.501
 score UPPERCASE_50_75 0.794 1.137 0 0
 score V+AG+A_ONLINE 1.100 1.101 3.151 4.056
 
 If it were up to me, I'd say that giving only half a point to a mail that 
 scores PENIS_ENLARGE2 is...  well, ludicrous.  Let's not kid ourselves. 
 IF there are people who participate on a genuine mailinglist that 
 discusses penis enlargement, let the burden fall on them to put those 
 adresses in their whitelist, not the reverse.

Have you looked carefully at those tests, rather than just 
reading their names and making assumptions?

ALL_CAP_PORN, for example, matches SUMMA CUM LAUDE or
J. ANAL. CHEM.

Messages sent by one of my users have triggered PENIS_ENLARGE 
and PENIS_ENLARGE2.  The offending phrases were these:

According to one study, tobacco use within five years of a
woman beginning her period appears to increase her risk of
developing breast cancer 

I tell patients who are concerned about breast cancer to
get an annual mammogram, increase their physical activity
and, if they smoke, quit smoking.

Breast cancer patients treated with lumpectomy and
radiation survive longer if they don't smoke.

And I've had several recent nonspam messages that triggered 
UPPERCASE rules.  One had a tab-separated text file attached 
containing a table from a database in which all the text was 
uppercase.

Rules almost always match messages you didn't intend them to 
match.  That's one reason why it's almost always a bad idea to 
assign a large score to any single rule.

-- 
Keith C. Ivey [EMAIL PROTECTED]
Washington, DC



---
This SF.Net email sponsored by: ApacheCon 2003,
16-19 November in Las Vegas. Learn firsthand the latest
developments in Apache, PHP, Perl, XML, Java, MySQL,
WebDAV, and more! http://www.apachecon.com/
___
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk