Re: Bayes Questions

2005-07-11 Thread Daniel J. Cody

Andrew,

Andrew Ott wrote:

Also is there any way to see the count of spam and ham messages that are in
the bayes database, I can't seem to find any info on that.  I want to make
sure there are a lot in there before I turn the bayes rules on.


If you run spamassassin --lint -D you should see a line that says 
something like:


debug: bayes corpus size: nspam = 1, nham = 5000

nspam is the number of spam messages, nham is the number of hams it has 
learned.


HTH

Dan



Bayes Questions

2005-07-11 Thread Andrew Ott
 For those of you running large sites ( we have about 12,000 users, with
210,000 messages a day) what do you have for a bayes_expiry_max_db_size?

Also is there any way to see the count of spam and ham messages that are in
the bayes database, I can't seem to find any info on that.  I want to make
sure there are a lot in there before I turn the bayes rules on.

Thank you.

Andrew



Re: simultaneous sa-learn processes

2005-07-11 Thread Robert Menschel
Hello Chavdar,

Monday, July 11, 2005, 3:40:14 AM, you wrote:

CV> Hi List,

CV> Our mailserver server serves about 100 users. Our config: 
CV> Sendmail+Procmail+SpamAssassin.
CV> The question is:
CV> If I got it right, we should run sa-learn for each user in order to benefit
CV> from bayes. We intend to run a cron job for each user and do it at night by
CV> supplying a daily snapshot of our spam and ham collections to sa-learn.
CV> Can our mailserver handle it (256 MB RAM, Celeron 400 Mhz)?
CV> A weekly collection run for 1 user usually eats 100% of CPU load. My concern
CV> is whether the system is going to crash or just do the job slower and if you
CV> can point out how many sa-learn tasks could we run simultaneously with our
CV> setup.
CV> All hints will be appreciated, for we scheduled an initial load for 16 users
CV> of the big collection of spam received so far.

As indicated in another email, doing a user-level learn of system-wide
collected ham/spam doesn't make much sense.  And if you take your
current system-wide collection and sa-learn it 100 times, you'll use
100 times more resources than learning it once.

On the other hand, if you meant that you'd sa-learn each individual
user's ham/spam for that user only, then move to the next, then
provided you do these one after the other sequentially (not all 100 at
once), you should not increase your system load at all.  (You will
increase your disk storage, since each user's database will take up
some disk space.)

As discussed in a couple of Bugzilla entries, you should probably
limit the size of your sa-learn runs -- limit them to a few hundred
emails at a time, or maybe a few meg combined size. A massive sa-learn
run of thousands of emails, dozens of meg in one run, can bring a
resource-limited system to its knees.

Bob Menschel





Re: Fedora changed SpamAssassin default level to 7?

2005-07-11 Thread Kelson

Justin Mason wrote:

fyi, if you're using Fedora Core --
http://blog.dave.org.uk/archives/000715.html

totally unconfirmed, but worth noting in case that really is the
case.


My copy of Fedora Core 4 has "required_hits 5" in local.cf using the 
distribution's RPM for Spamassassin.  rpm -Va made no complaints about 
the file.  Just to be sure, I uninstalled it, checked that local.cf was 
gone, and reinstalled it via yum.  Standard defaults.


It looks to me like something other than Fedora Core was messing with 
his config.


--
Kelson Vibber
SpeedGate Communications 


Re: Bypass URI check

2005-07-11 Thread Daryl C. W. O'Shea

[EMAIL PROTECTED] wrote:

Hi All,

I have received a few messages like the following.  This asks the
receiver to copy and past the link into their web browser.  Since the
href is missing, there is no URI check.  That sucks, because the URIBL
is my best friend right now (love black).  We are close to marking it
and URIBL would have definitely got it over.  Any ideas on handling
this?


SpamAssassin 3.1.0 will catch these.  Depending on your environment you 
could consider running 3.1.0-pre3.


Daryl



Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Kelson

Joe Flowers wrote:
BTW, if anyone knows a command line program that can easy run thu a 
bunch of mbox files and tell how many messages are in them, I will 
report back how many ham and how many spam messages that I have fed to 
bayes. It's far from perfect, but it may offer some interesting info 
regarding the 100:1 (fn:fp) ratio.


I usually do this:

grep -c "^From " filename

It's not perfect, since it's theoretically possible for someone to have 
a line in their message that starts with
From (to provide an example -- see if your mbox-generating program 
escapes that line!), but it's usually enough.


--
Kelson Vibber
SpeedGate Communications 


Fedora changed SpamAssassin default level to 7?

2005-07-11 Thread Justin Mason
fyi, if you're using Fedora Core --
http://blog.dave.org.uk/archives/000715.html

totally unconfirmed, but worth noting in case that really is the
case.

--j.


Re: (repost) bayes_ignore_from with wildcard ?

2005-07-11 Thread Daryl C. W. O'Shea

Matt Kettler wrote:
Although by looking at _check_whitelist, I wonder if it works the way 
the docs say. The docs claim it's file glob and not regex, but 
_check_whitelist looks a lot like it does a regex.


_check_whitelist does use a regexp to do the matching but the config 
parser (add_to_addrlist() and add_to_addrlist_rcvd()) only passes file 
glob style expressions.  Any other regexp style metacharacters are escaped.


Daryl



Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Joe Flowers
> BTW, if anyone knows a command line program that can easy run thu a 
bunch of mbox files and tell how many messages are in them, I will report

>  back how many ham and how many spam messages that I have fed to bayes.

Well, I thought this might give some good stats on the FP:FN ratio, but 
I forgot I manually fed Bayes at the very beginning of the SA 3.02 
install to get it kick-started immediately. So, counting those messages 
won't give anything accurate :( Initially, I thought I was feeding Bayes 
just the FPs and FNs, but I forgot about the initial feeding.





Help debugging spamc/spamd

2005-07-11 Thread email builder
Hi,

  We recently changed some of our network topology so that we are temporarily
connecting with spamc to spamd over a regular external network connection (we
usually keep it inside our LAN, but this is a temporary thing... don't ask).

  Unfortunately, spamd stops (mostly) responding it seems.  I can watch spamc
sitting and waiting on the MTA by using "ps ax | grep spam" but I don't see
anything happening on the spamd server except for once every 15 minutes or
so, a few messages will process (there are hundreds a minute to process). 
I'm not sure where it would be choking.

  I ran spamd in the foreground (-D), painstakingly read all the debug info
for a couple messages, and nothing looked bad.  When messages DID scan, they
took no more than a second or two, so I don't think there are DNS issues, but
I don't know where else to look.  Things just seem to stop processing
suddenly; I don't get it.

  Anyone have hints?


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Kai Schaetzl
Kai Schaetzl wrote on Mon, 11 Jul 2005 22:31:29 +0200:

> With the default of 5 we get almost none, not even one per day.

That was about FPs. Wrong. We don't get *any* FPs. We do not get even one 
*FN* per day.

Kai

-- 
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com
IE-Center: http://ie5.de & http://msie.winware.org





Re: Performance: files or SQL?

2005-07-11 Thread Michael Parker
Cami wrote:

> SQL simply doesnt scale very well for bayes. We have a serverfarm of
> 12 spamassassin servers and storing bayes  in SQL. We see on average
> about 4000 queries per second.  The MySQL  server has been optimized
> to hell and back and is running on high-end hardware,but just simply
> doesnt scale as more and more mail begins to roll in.
>

I'd be interested in your setup (MySQL and SA).  I have no problems
getting 4200+ queries per second on a single processor machine, with
spamd running on the same box.  It's not even sweating that hard.  When
you are at peak what is your average scantime per msg?

Michael


signature.asc
Description: OpenPGP digital signature


Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Kris Deugau
jdow wrote:
> A few weeks ago I'd have said "Easy, Ducky!" Then I ran into DoveCot
> that uses an indexed almost "mbox" file. There is no way to do it
> other than "good guess". However, for a traditional UNIX mbox file
> you should be able to nail it perfectly simply looking for the "From"
> feature. The dirt stupid "mail" utility looks for a blank line
> followed by a line that starts with "From". All other lines that
> start with From are supposed to be escaped to ensure accurate
> detection. DoveCot skips this blank like feature sometimes. "mail"
> does not like this. I have not yet seen any indication that SA is
> upset with this, however.

Just to be pedantic, it's actually (IIRC) a double newline followed by
"From " (note the space!  It's important.)  Many mail-handling apps will
actually parse the From-space "header" in more detail, "just in case".

grep "^From " |wc -l typically gives an accurate count; 
procmail at least is bright enough to escape message body lines such
that they don't break this.

-kgd
-- 
Get your mouse off of there!  You don't know where that email has been!


Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Kai Schaetzl
Loren Wilton wrote on Mon, 11 Jul 2005 11:30:07 -0700:

> Which of course means that by picking the ratio value you can pick pretty 
> much any fp/fn ratio you want.

Only if the distribution was equal.

Kai

-- 
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com
IE-Center: http://ie5.de & http://msie.winware.org





Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Kai Schaetzl
Joe Flowers wrote on Mon, 11 Jul 2005 12:09:29 -0400:

> We are very glad and happy about this concept and implementation.

Well, the big question is: How many of your spam messages score between 
the default 5 and your "floating score"? If it is many there's obviously 
something wrong with your setup: your spam is not scoring high enough. 
Additionally, it means that your Bayes auto-learn will feed less spam to 
learn than it could because your overall spam score is way too low. Our 
average spam score is indeed around -2 as yours is. And it's a very high 
peak, -2 mails are more than any other ham mails combined. However, our 
spam score peak is *way* higher than yours is: it "flattens" over 18 and 
30, so the average is somewhere around 25 or so. (I deduced that from 
looking at the raw figures not by calculating a median or average.) I 
consider your average spam score of 6 as *extremely* bad from a detection 
standpoint.
With a score of 0.5 I would get a *considerable* amount of ham scored as 
spam. With the default of 5 we get almost none, not even one per day. I 
doubt that your rate of FPs is nearly non-existant with a spam threshold 
of 0.5. There *must* be a considerable rate of FPs, you just don't hear 
about it.

I think the general approach on this list is to make spam score as spammy 
as possible. That's what we do as well. Instead of driving spam to the sky 
you are trying to find some non-existing "barrier" which may indeed float 
because tomorrow's messages score different than yesterday's. It does not 
float at all in the long run. And it exists *only* in the long run. It may 
throw off next day's detection quite heavily, since there's no guarantee 
spam and ham look the same next day or even float around that point. It's 
not even a statistical figure, you deliberately set it to 30%, probably 
because you get too much spam if you set it higher. That's bad, really bad 
detection ... 
If much of your spam is lower than 5 than the spam detection rate of your 
SA is quite bad. You should improve that instead of trying to find a 
barrier which gives you the best FP:FN ratio. It may indeed give you the 
best ratio with your bad setup but not the lowest FP rate and probably not 
the best ratio compared with a setup that drives spam to the sky.
I see your approach as an interesting way of optimizing the threshold when 
you don't get optimal scores. But you would be better off to optimize the 
scores.

BTW: what does "normalized" exactly mean in this context?

Kai

-- 
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com
IE-Center: http://ie5.de & http://msie.winware.org





Re: SA 2.63 vs 2.64

2005-07-11 Thread Matthias Fuhrmann
On Sun, 10 Jul 2005, Matthias Fuhrmann wrote:

[...]
>   # jm: do not...
>
> the lines from Bayes.pm fits to the error messages. didnt checked
> PerMsgStatus.pm, but i guess its the same issue.
> can someone explain the difference or the impact to the problem, described
> above?
>
> what about replacing the line of 2.64 with the old working one from 2.63?
> hope i'm not too wrong, since i try debugging for some hours now :)

just in case someone starts bothering. i've upgraded to 3.0.4 and
surprisingly there were only some rules to fix and bayesdb, which we had
to convert.
best of all, the error messages from 2.64 are gone and syslog outputs are
now a lot more verbose, very nice :)

regards,
Matthias


Re: Performance: files or SQL?

2005-07-11 Thread Cami

Mike Jackson wrote:
On my personal server, I'm running SA 3.0.4 with the user prefs, Bayes, 
and AWL in a MySQL database (mostly because it would be "cooler" that 
way). On my employer's server, I'm running the same SA version, but with 
file-based DBs and user prefs. We're going to be rolling out doing 
filtering for all our mailboxes (several hundred) as opposed to opt-in 
(as we're doing it now on about 20 accounts). I know I could do 
benchmarks myself, but I wanted to get your impressions if there's a 
performance improvement using SQL for storage (user prefs, Bayes, AWL) 
rather than files. Thanks.


SQL simply doesnt scale very well for bayes. We have a serverfarm of
12 spamassassin servers and storing bayes  in SQL. We see on average
about 4000 queries per second.  The MySQL  server has been optimized
to hell and back and is running on high-end hardware,but just simply 
doesnt scale as more and more mail begins to roll in.


Cami


Re: procmail: Could not create INET socket on 127.0.0.1:783: Permission denied

2005-07-11 Thread jdow
From: <[EMAIL PROTECTED]>

> Hello,
> 
> I set up spamassassin to work with procmail according to instructions.
> Here is what is in ~/.procmailrc:
> 
> #SPAM ASSASSIN SECTION
> 
> :0fw: spamd.lock
> * < 256000
> | /usr/sbin/spamd
  ^ The spamd tool is run as a daemon. You want spamc
here. Start spamd in Mandrake, RedHat, and I believe SUSE with the
"chkconfig spamassassin on;service spamassassin start" mantra if a
"service spamassassin status" does not report it is running.

> :0:
> * ^X-Spam-Level: \*\*\*\*\*\*\*\*\*\*\*\*\*\*\*
> almost-certainly-spam
> 
> :0:
> * ^X-Spam-Status: Yes
> probably-spam
> 
> :0
> * ^^rom[ ]
> {
>   LOG="*** Dropped F off From_ header! Fixing up. "
> 
>   :0 fhw
>   | sed -e '1s/^/F/'
> }
> 
> #===END SPAM ASSASSIN SECTION==

{^_^}



procmail: Could not create INET socket on 127.0.0.1:783: P ermission denied

2005-07-11 Thread prosolutions
Hello,

I set up spamassassin to work with procmail according to instructions.
Here is what is in ~/.procmailrc:

#SPAM ASSASSIN SECTION

:0fw: spamd.lock
* < 256000
| /usr/sbin/spamd

:0:
* ^X-Spam-Level: \*\*\*\*\*\*\*\*\*\*\*\*\*\*\*
almost-certainly-spam

:0:
* ^X-Spam-Status: Yes
probably-spam

:0
* ^^rom[ ]
{
  LOG="*** Dropped F off From_ header! Fixing up. "

  :0 fhw
  | sed -e '1s/^/F/'
}

#===END SPAM ASSASSIN SECTION==


However spamd is failing to run.  This is what I see in the procmail
log:

procmail: Unlocking "/home/user/.lockmail"
procmail: [19772] Mon Jul 11 11:18:53 2005
procmail: Match on "< 256000"
procmail: Locking "spamd.lock"
procmail: Executing "/usr/sbin/spamd"
Could not create INET socket on 127.0.0.1:783: Permission denied
(IO::Socket::INET: Permission denied)
procmail: [19772] Mon Jul 11 11:18:54 2005
procmail: Program failure (13) of "/usr/sbin/spamd"





-- 
Weitersagen: GMX DSL-Flatrates mit Tempo-Garantie!
Ab 4,99 Euro/Monat: http://www.gmx.net/de/go/dsl


Performance: files or SQL?

2005-07-11 Thread Mike Jackson
On my personal server, I'm running SA 3.0.4 with the user prefs, Bayes, and 
AWL in a MySQL database (mostly because it would be "cooler" that way). On 
my employer's server, I'm running the same SA version, but with file-based 
DBs and user prefs. We're going to be rolling out doing filtering for all 
our mailboxes (several hundred) as opposed to opt-in (as we're doing it now 
on about 20 accounts). I know I could do benchmarks myself, but I wanted to 
get your impressions if there's a performance improvement using SQL for 
storage (user prefs, Bayes, AWL) rather than files. Thanks.


Mike Jackson
Tech Administrator, Datahost
www.datahost.com



Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread jdow
A few weeks ago I'd have said "Easy, Ducky!" Then I ran into DoveCot
that uses an indexed almost "mbox" file. There is no way to do it other
than "good guess". However, for a traditional UNIX mbox file you should
be able to nail it perfectly simply looking for the "From" feature. The
dirt stupid "mail" utility looks for a blank line followed by a line
that starts with "From". All other lines that start with From are supposed
to be escaped to ensure accurate detection. DoveCot skips this blank like
feature sometimes. "mail" does not like this. I have not yet seen any
indication that SA is upset with this, however.

{^_^}
- Original Message - 
From: "Joe Flowers" <[EMAIL PROTECTED]>

> Matt:
>
> I know you know a lot more about this than I do, but for what it's
> worth, you're impressions/intuitions are very close to mine.
> Originally back in April, I started off using the "average of the
> means", but that let through way too much spam.
>
> So, what I have now is it set to 30% above the average spam score, which
> is 20% below the "average of the means".
> The assumption being that the optimal spot is somewhere between the two
> averages.
>
> Also, that nastly drop off that produces a lot of FPs is in my intuition
> too and as of yet, we haven't run into it.
>
> Now, if the two curves could be slid apart wider so that there is a big
> deadzone,... Although, without upgrading to a newer version of SA, I
> don't see how I can expect much better results.
>
> BTW, if anyone knows a command line program that can easy run thu a
> bunch of mbox files and tell how many messages are in them, I will
> report back how many ham and how many spam messages that I have fed to
> bayes. It's far from perfect, but it may offer some interesting info
> regarding the 100:1 (fn:fp) ratio.
>
> Joe
>
>
> Matt Kettler wrote:
>
> >Joe Flowers wrote:
> >
> >
> >>Matt Kettler wrote:
> >>
> >>
> >>
> >>>The only problem I see with this approach is that it treats false
> >>>positives and
> >>>false negatives as being equally bad.
> >>>
> >>>
> >>>
> >>>
> >>We do get many more false negatives than false positives, even though we
> >>don't get false positives very often - they are rare.
> >>We certainly don't get 1 fp for every fn.
> >>
> >>
> >>
> >>>In general, you're adjusting the score bias so the number of FP's and
> >>>FNs are
> >>>approximately equal.
> >>>
> >>>
> >>This is not what we are seeing in practice. It's not even close to
50-50.
> >>
> >>
> >>
> >
> >Based on JM's comments about the score distribution for hams being
non-linear,
> >this makes sense. If the distribution was linear for both you'd get 50/50
by
> >dividing the score between the two means.
> >
> >Since the ham is going to have a pretty sharp drop-off somewhere slightly
above
> >it's mean your split score approach won't be as bad as 1:1, but it's also
likely
> >to not be as good as 100:1 which the 5.0 threshold should get you.
> >
> >It's an interesting concept, and it would be very interesting to graph
out FP vs
> >FN rates against thresholds.
> >
> >This graph from JM's post is real data:
> >http://spamassassin.apache.org/presentations/HEANet_2002/img12.html
> >
> >But it doesn't go below 5.0. It would be interesting to see how those
curves
> >continue as you approach 0.
> >
> >This graph is a good conceptual one in the "normal" sense of numbers:
> >http://taint.org/xfer/2005/score-dist-doodle.gif
> >
> >That graph would suggest that somewhere below 5.0 there is a threshold at
which
> >the ham FP rate gets MUCH worse in a very sudden way. However, there's no
score
> >associated. I'd venture to guess that your "average of the means" is
going to
> >wind up picking something near, but just above that threshold.
> >
> >That's a bit of an intuitive guess, but also it has some roots in
reality. The
> >average score of a ham message on a curve like that is going to wind up
being
> >somewhere in the middle of that nasty drop off. By biasing just above
that you
> >should bring yourself into the second part of the curve, where decreases
in
> >score have a somewhat modest impact on FP rate.
> >
> >
> >
>




RE: SURBL, SA 3.0.4, and firewalls

2005-07-11 Thread Stewart, John

> All it needs is port 53 TCP and UDP open (outbound), 
> depending on what 
> firewall product you use, depends on how. A bit of Google with what 
> ports on what product will yield what you should need.

One thing to note... if your firewall is proxying for you, make sure it
doesn't think it's authoritative for the 127.0.0.X stuff. Ours did and when
it got a reply back from the SURBL servers with a result of 127.0.0.10, for
example, the firewall actually returned NXDOMAIN because it saw that the
results were in a domain it was authoritative for, and discarded them as
invalid.

johnS


Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Joe Flowers

jdow wrote:

> The greater the separation  the
> better the results for a decision point between them.

> But anything you can do that widens the
> typical score distribution between ham and spam is a good thing.

Amen




Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Loren Wilton
> There's another thing worth noting -- the SpamAssassin score distribution
> for hams and spams isn't even.

I don't necessarily see that those particular curve shapes necessarily in
any way invalidate this method, although they do bias the method somewhat.
The two curves are essentially smooth curves with no major dips or bumps in
them, so it is possible to select a ratio without getting inversions in the
ratio as the selector moves from left to right.  You may have to be careful
of calculating the ratio, given that ham goes to effectively zero above a
certain value.  But n:0 and 3.45n:0 are still perfectly valid ratios to deal
with, even if one of the terms is zero.

Loren



Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Joe Flowers

Matt:

I know you know a lot more about this than I do, but for what it's 
worth, you're impressions/intuitions are very close to mine.
Originally back in April, I started off using the "average of the 
means", but that let through way too much spam.


So, what I have now is it set to 30% above the average spam score, which 
is 20% below the "average of the means".
The assumption being that the optimal spot is somewhere between the two 
averages.


Also, that nastly drop off that produces a lot of FPs is in my intuition 
too and as of yet, we haven't run into it.


Now, if the two curves could be slid apart wider so that there is a big 
deadzone,... Although, without upgrading to a newer version of SA, I 
don't see how I can expect much better results.


BTW, if anyone knows a command line program that can easy run thu a 
bunch of mbox files and tell how many messages are in them, I will 
report back how many ham and how many spam messages that I have fed to 
bayes. It's far from perfect, but it may offer some interesting info 
regarding the 100:1 (fn:fp) ratio.


Joe


Matt Kettler wrote:


Joe Flowers wrote:
 


Matt Kettler wrote:

   


The only problem I see with this approach is that it treats false
positives and
false negatives as being equally bad.


 


We do get many more false negatives than false positives, even though we
don't get false positives very often - they are rare.
We certainly don't get 1 fp for every fn.

   


In general, you're adjusting the score bias so the number of FP's and
FNs are
approximately equal.
 


This is not what we are seeing in practice. It's not even close to 50-50.

   



Based on JM's comments about the score distribution for hams being non-linear,
this makes sense. If the distribution was linear for both you'd get 50/50 by
dividing the score between the two means.

Since the ham is going to have a pretty sharp drop-off somewhere slightly above
it's mean your split score approach won't be as bad as 1:1, but it's also likely
to not be as good as 100:1 which the 5.0 threshold should get you.

It's an interesting concept, and it would be very interesting to graph out FP vs
FN rates against thresholds.

This graph from JM's post is real data:
http://spamassassin.apache.org/presentations/HEANet_2002/img12.html

But it doesn't go below 5.0. It would be interesting to see how those curves
continue as you approach 0.

This graph is a good conceptual one in the "normal" sense of numbers:
http://taint.org/xfer/2005/score-dist-doodle.gif

That graph would suggest that somewhere below 5.0 there is a threshold at which
the ham FP rate gets MUCH worse in a very sudden way. However, there's no score
associated. I'd venture to guess that your "average of the means" is going to
wind up picking something near, but just above that threshold.

That's a bit of an intuitive guess, but also it has some roots in reality. The
average score of a ham message on a curve like that is going to wind up being
somewhere in the middle of that nasty drop off. By biasing just above that you
should bring yourself into the second part of the curve, where decreases in
score have a somewhat modest impact on FP rate.

 






Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Loren Wilton
> > score of -2.1532284.  I have the divding line "set" at 30% of the
> > distance between the average ham score and average spam score (30% above
> > the average ham score). So, the dividing line is currently floating
> > around 0.55416414.
>
>
> The only problem I see with this approach is that it treats false
positives and
> false negatives as being equally bad.

Matt, isn't he actually treating an FP as ~2x as bad as an FN?  He has the
divider set to 30%, so is biassed in one direction or the other.

Which of course means that by picking the ratio value you can pick pretty
much any fp/fn ratio you want.

Loren



Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread jdow
From: "Matt Kettler" <[EMAIL PROTECTED]>

> Joe Flowers wrote:
> > I don't know if this will help anyone or not, but I wanted to report
> > back just in case.
> >
> > In early April, I completely unhinged the dividing line between what SA
> > score is used to mark a message as spam or ham (5.00 = default). This
> > allows the system and this dividing line to drift "freely" to anywhere
> > that SA will allow, without bound. This anti-spam setup has worked
> > consistently much much better the whole time than in any previous
> > implementation that we have done and with very little maintenance. We
> > are very happy with it and are looking forward to implementing future SA
> > versions in the same fashion.
> >
> > I'm not exactly sure the following numbers represent the whole time
> > since April, but they should be pretty close.
> >
> > We've had 360,922 spam messages and 396,983 ham messages with a
> > normalized average spam score of 6.8714134 and a normalized average ham
> > score of -2.1532284.  I have the divding line "set" at 30% of the
> > distance between the average ham score and average spam score (30% above
> > the average ham score). So, the dividing line is currently floating
> > around 0.55416414.
>
>
> The only problem I see with this approach is that it treats false
positives and
> false negatives as being equally bad.
>
> In general, you're adjusting the score bias so the number of FP's and FNs
are
> approximately equal. Although STATISTICS*.txt would suggest that this
boundary
> occurs somewhere near 2.0, your own local biases could change this
considerably.
>
>
> SA's normal scoreset is evolved with the concept that it's better to have
99
> false negatives than 1 false positive. The concept here is most people use
> scripts to move their spam into a separate folder, or auto delete it. With
that
> going on, a FP is potentially lost valid email, whereas a FN is a minor
> inconvenience.

Operating experience here seems to indicate that the SA score evolution
is not optimum. What you want to do is create a  brassiere curve
for the markups for ham and spam. The greater the separation  the
better the results for a decision point between them. The bias to
prevent false negatives probably means you do not want the decision
point right in the center. But anything you can do that widens the
typical score distribution between ham and spam is a good thing. It makes
the decision point less sensitive to set and the overall error rates
lower. I think this is part of the reason I have so much success on a
box vastly overloaded with SARE and other rules. The good rules pile
one on the other until it's VERY clear what is ham and what is spam.

(It surely would be nice if there were some really good indications of
"not spam". However, nothing has ever appeared other than absence of
hits on spam-sign.)

{^_^}




RE: sa-learn on a wide site HOWTO ?

2005-07-11 Thread Aaron Grewell

> Forget about this. Most of you users will only report spams, 
> not ham, they're going to screw the bayes database. As a 
> consequence, you'll have more spam, or more fp.
> 
> You should find another solution or educate your users (but 
> it takes too much time) so they feed correctly the bayesian filters.
> 

I've heard this many times, but my experience thus far hasn't borne it out.
We've got SA w/Bayes running site-wide on our 400-user system and Bayes_99
is consistently our highest-scoring test systemwide, even outscoring the
various SBL and URIBL tests.  That said, the Ham corpus is entirely my own,
I don't bother to have my users submit anything but Spam.  This works
surprisingly well, so I guess I have good Ham. :)

My method is simple and fairly manual.  I have my users put Spam in an
Exchange Public Folder (substitute shared IMAP folder if you're using a more
standard e-mail server) and copy them down into a local MBOX.  Thunderbird
is handy for this.  I upload the MBOX file to the SA server, run sa-learn,
and it's done.  Initially I had to do this fairly often, but once I had it
well trained and enough SARE rules in place it became less of an issue.  I
now run it only every other month or so.  Bayes covers a number of
corner-cases that aren't covered by rules, so it's an important part of my
overall strategy.  It's also handy to train in new spam that hasn't hit the
URIBLs or other rules yet, much easier than writing custom rules.

Bayes hasn't given any false positives that I'm aware of in the last year,
despite the theoretical skew that ought to be introduced by using everyone's
Spam and only my Ham.  I cannot tell you why, but it works and it works
well.

Aaron Grewell
Network Administrator
University of Washington Bothell


Re: simultaneous sa-learn processes

2005-07-11 Thread jdow
From: "Chavdar Videff" <[EMAIL PROTECTED]>

> On Monday 11 July 2005 14:50, JamesDR wrote:
> > Chavdar Videff wrote:
> > > Hi List,
> > >
> > > Our mailserver server serves about 100 users. Our config:
> > > Sendmail+Procmail+SpamAssassin.
> > > The question is:
> > > If I got it right, we should run sa-learn for each user in order to
> > > benefit from bayes. We intend to run a cron job for each user and do
it
> > > at night by supplying a daily snapshot of our spam and ham collections
to
> > > sa-learn. Can our mailserver handle it (256 MB RAM, Celeron 400 Mhz)?
> > > A weekly collection run for 1 user usually eats 100% of CPU load. My
> > > concern is whether the system is going to crash or just do the job
slower
> > > and if you can point out how many sa-learn tasks could we run
> > > simultaneously with our setup.
> > > All hints will be appreciated, for we scheduled an initial load for 16
> > > users of the big collection of spam received so far.
> > >
> > > Thanks guys
> > >
> > > Chavdar Videff
> >
> > What kind of Bayes db are you using? We use MySQL here and haven't seen
> > SA-Learn use up that much cpu... I've run it manually up to 10 processes
> > at once without any noticeable slowing of the machine. (p2 450mhz,
256mb)
>
> I guess it is BerkeleyDB, the default installation on Debian. The
ineteresting
> part is that while testing cron on one user the cpu fall was not
noticeable.

If feeding individual user Bayes feed with ham samples and spam samples
submitted by the particular user for HER Bayes. If you have them all
working off the same Bayes corpus then there is little or no gain to
using per user Bayes.

{^_^}




Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


the real-world figures can be seen for various thresholds in
the rules/STATISTICS*.txt files...

- --j.

Matt Kettler writes:
> Joe Flowers wrote:
> > Matt Kettler wrote:
> > 
> >> The only problem I see with this approach is that it treats false
> >> positives and
> >> false negatives as being equally bad.
> >>  
> >>
> > 
> > We do get many more false negatives than false positives, even though we
> > don't get false positives very often - they are rare.
> > We certainly don't get 1 fp for every fn.
> > 
> >> In general, you're adjusting the score bias so the number of FP's and
> >> FNs are
> >> approximately equal.
> > 
> > 
> > This is not what we are seeing in practice. It's not even close to 50-50.
> > 
> 
> Based on JM's comments about the score distribution for hams being non-linear,
> this makes sense. If the distribution was linear for both you'd get 50/50 by
> dividing the score between the two means.
> 
> Since the ham is going to have a pretty sharp drop-off somewhere slightly 
> above
> it's mean your split score approach won't be as bad as 1:1, but it's also 
> likely
> to not be as good as 100:1 which the 5.0 threshold should get you.
> 
> It's an interesting concept, and it would be very interesting to graph out FP 
> vs
> FN rates against thresholds.
> 
> This graph from JM's post is real data:
> http://spamassassin.apache.org/presentations/HEANet_2002/img12.html
> 
> But it doesn't go below 5.0. It would be interesting to see how those curves
> continue as you approach 0.
> 
> This graph is a good conceptual one in the "normal" sense of numbers:
> http://taint.org/xfer/2005/score-dist-doodle.gif
> 
> That graph would suggest that somewhere below 5.0 there is a threshold at 
> which
> the ham FP rate gets MUCH worse in a very sudden way. However, there's no 
> score
> associated. I'd venture to guess that your "average of the means" is going to
> wind up picking something near, but just above that threshold.
> 
> That's a bit of an intuitive guess, but also it has some roots in reality. The
> average score of a ham message on a curve like that is going to wind up being
> somewhere in the middle of that nasty drop off. By biasing just above that you
> should bring yourself into the second part of the curve, where decreases in
> score have a somewhat modest impact on FP rate.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFC0q8dMJF5cimLx9ARAuLrAKCQnoc8eo2rAvIDYIWX0DfW/T0NZgCePoyH
WZS8C6aamuWZ3H6C6n8k2n0=
=Hruw
-END PGP SIGNATURE-



Re: How can I correctly detect these spams?

2005-07-11 Thread Kai Schaetzl
I repeat myself ;-)

> It seems you are not using *any* custom rules. You may want to check out 
> RDJ and SARE.



Kai

-- 
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com
IE-Center: http://ie5.de & http://msie.winware.org





Re: simultaneous sa-learn processes

2005-07-11 Thread Kai Schaetzl
Chavdar Videff wrote on Mon, 11 Jul 2005 16:13:44 +0300:

> If there is a way to set up a single bayes database I would prefer that

There is one, just look in the SA documentation. (documentation for 
local.cf should do.)

Kai

-- 
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com
IE-Center: http://ie5.de & http://msie.winware.org





Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Matt Kettler
Joe Flowers wrote:
> Matt Kettler wrote:
> 
>> The only problem I see with this approach is that it treats false
>> positives and
>> false negatives as being equally bad.
>>  
>>
> 
> We do get many more false negatives than false positives, even though we
> don't get false positives very often - they are rare.
> We certainly don't get 1 fp for every fn.
> 
>> In general, you're adjusting the score bias so the number of FP's and
>> FNs are
>> approximately equal.
> 
> 
> This is not what we are seeing in practice. It's not even close to 50-50.
> 

Based on JM's comments about the score distribution for hams being non-linear,
this makes sense. If the distribution was linear for both you'd get 50/50 by
dividing the score between the two means.

Since the ham is going to have a pretty sharp drop-off somewhere slightly above
it's mean your split score approach won't be as bad as 1:1, but it's also likely
to not be as good as 100:1 which the 5.0 threshold should get you.

It's an interesting concept, and it would be very interesting to graph out FP vs
FN rates against thresholds.

This graph from JM's post is real data:
http://spamassassin.apache.org/presentations/HEANet_2002/img12.html

But it doesn't go below 5.0. It would be interesting to see how those curves
continue as you approach 0.

This graph is a good conceptual one in the "normal" sense of numbers:
http://taint.org/xfer/2005/score-dist-doodle.gif

That graph would suggest that somewhere below 5.0 there is a threshold at which
the ham FP rate gets MUCH worse in a very sudden way. However, there's no score
associated. I'd venture to guess that your "average of the means" is going to
wind up picking something near, but just above that threshold.

That's a bit of an intuitive guess, but also it has some roots in reality. The
average score of a ham message on a curve like that is going to wind up being
somewhere in the middle of that nasty drop off. By biasing just above that you
should bring yourself into the second part of the curve, where decreases in
score have a somewhat modest impact on FP rate.


Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Joe Flowers

Thanks Jason!

That's good, new info for me. That'll help me *at the very least* 
visualize what I am trying to do a little better. I've been very curious 
to know what the rough shapes of those graphs look like.


Joe



Justin Mason wrote:


-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


There's another thing worth noting -- the SpamAssassin score distribution
for hams and spams isn't even.

If you draw a graph of hams and spams, plotting the number of mails in
each category as the vertical axis and the score they get as teh
horizontal axis, you don't get a simple pair of intersecting straight
lines.

Instead, since we have many more mark-as-spam rules than mark-as-ham,
and due to how the perceptron attempts to optimise for the 5.0
threshold, what happens is that you have two different lines.

The ham line is a sigmoid curve, that starts high in the negative area,
and curves down to almost 0 at the 5.0 threshold mark.  The spam line, by
contrast, is a straight line.
http://taint.org/xfer/2005/score-dist-doodle.gif is a doodle to illustrate
this, or take a look at
http://spamassassin.apache.org/presentations/HEANet_2002/img12.html
for real-world graphs of this data from 2002 -- although graphing
the inverse.

Very interesting approach though!

- --j.
 






Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


There's another thing worth noting -- the SpamAssassin score distribution
for hams and spams isn't even.

If you draw a graph of hams and spams, plotting the number of mails in
each category as the vertical axis and the score they get as teh
horizontal axis, you don't get a simple pair of intersecting straight
lines.

Instead, since we have many more mark-as-spam rules than mark-as-ham,
and due to how the perceptron attempts to optimise for the 5.0
threshold, what happens is that you have two different lines.

The ham line is a sigmoid curve, that starts high in the negative area,
and curves down to almost 0 at the 5.0 threshold mark.  The spam line, by
contrast, is a straight line.
http://taint.org/xfer/2005/score-dist-doodle.gif is a doodle to illustrate
this, or take a look at
http://spamassassin.apache.org/presentations/HEANet_2002/img12.html
for real-world graphs of this data from 2002 -- although graphing
the inverse.

Very interesting approach though!

- --j.

Joe Flowers writes:
> Matt Kettler wrote:
> 
> >The only problem I see with this approach is that it treats false positives 
> >and
> >false negatives as being equally bad.
> >  
> >
> 
> We do get many more false negatives than false positives, even though we 
> don't get false positives very often - they are rare.
> We certainly don't get 1 fp for every fn.
> 
> >In general, you're adjusting the score bias so the number of FP's and FNs are
> >approximately equal. 
> >
> 
> This is not what we are seeing in practice. It's not even close to 50-50.
> 
> >Although STATISTICS*.txt would suggest that this boundary
> >occurs somewhere near 2.0, your own local biases could change this 
> >considerably.
> >
> >
> >SA's normal scoreset is evolved with the concept that it's better to have 99
> >false negatives than 1 false positive. 
> >
> 
> We are very glad and happy about this concept and implementation.
> 
> >The concept here is most people use
> >scripts to move their spam into a separate folder, or auto delete it. With 
> >that
> >going on, a FP is potentially lost valid email, whereas a FN is a minor
> >inconvenience.
> >  
> >
> 
> Yes We work hard to inform our users and to actively solicit their 
> feedback on how the system is working and to lookout for the system 
> misplacing emails, especially valid ones. I know it's still not perfect
> 
> >For any site that considers FPs to be "not too bad" because all mail is 
> >manually
> >examined anyway, lowering the score threshold may be a workable thing.
> >
> >However, other sites that auto-delete such messages may have considerable
> >problems if they lower the threshold
> >  
> >
> 
> YES!
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFC0qYfMJF5cimLx9ARAp+YAJ0X7eoijcnMOE+3WkOlfQQEzasjwgCfZp9B
TdyM6BfLga48fgif1AzBW7U=
=qdan
-END PGP SIGNATURE-



Re: sa-learn on a wide site HOWTO ?

2005-07-11 Thread Julien Reveret
On 16:56, Mon 11 Jul 05, Karl.Oulmi wrote:
> Hi,
> 
> I always have a box with postfix/amavis and Spamassin running.
> Now, I'd like to run sa-learn in order my users (~500) learn Spam & Ham 
> to Spamassassin.
> 
> The idea is the following.
> On every mail passed through my mailserver, a header or a footer is 
> added to the mail with à mailto link that permit my users to learn 
> Spamassassin if the mail is spam or not.
 
Forget about this. Most of you users will only report spams,
not ham, they're going to screw the bayes database. As a
consequence, you'll have more spam, or more fp.

You should find another solution or educate your users (but
it takes too much time) so they feed correctly the bayesian
filters.


Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Joe Flowers

Matt Kettler wrote:


The only problem I see with this approach is that it treats false positives and
false negatives as being equally bad.
 



We do get many more false negatives than false positives, even though we 
don't get false positives very often - they are rare.

We certainly don't get 1 fp for every fn.


In general, you're adjusting the score bias so the number of FP's and FNs are
approximately equal. 



This is not what we are seeing in practice. It's not even close to 50-50.


Although STATISTICS*.txt would suggest that this boundary
occurs somewhere near 2.0, your own local biases could change this considerably.


SA's normal scoreset is evolved with the concept that it's better to have 99
false negatives than 1 false positive. 



We are very glad and happy about this concept and implementation.


The concept here is most people use
scripts to move their spam into a separate folder, or auto delete it. With that
going on, a FP is potentially lost valid email, whereas a FN is a minor
inconvenience.
 



Yes We work hard to inform our users and to actively solicit their 
feedback on how the system is working and to lookout for the system 
misplacing emails, especially valid ones. I know it's still not perfect



For any site that considers FPs to be "not too bad" because all mail is manually
examined anyway, lowering the score threshold may be a workable thing.

However, other sites that auto-delete such messages may have considerable
problems if they lower the threshold
 



YES!





RE: spamassassin with GORDANO

2005-07-11 Thread Bret Miller
> Does anyone know If I can use Spammain with GMS (Gordano
> Mail Software for Linux)

In theory, you could use MailScanner as a proxy in front of GMS to run
SpamAssassin before the message gets to GMS.

And, if I recall correctly (I haven't used GMS for several years), I
think you can use their MML scripting language to run an application, so
you should be able to run SpamAssassin from there and replace the
original message with the tagged version if your MML scripting skills
are adequate for doing that.

Bret





Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Matt Kettler
Joe Flowers wrote:
> I don't know if this will help anyone or not, but I wanted to report
> back just in case.
> 
> In early April, I completely unhinged the dividing line between what SA
> score is used to mark a message as spam or ham (5.00 = default). This
> allows the system and this dividing line to drift "freely" to anywhere
> that SA will allow, without bound. This anti-spam setup has worked
> consistently much much better the whole time than in any previous
> implementation that we have done and with very little maintenance. We
> are very happy with it and are looking forward to implementing future SA
> versions in the same fashion.
> 
> I'm not exactly sure the following numbers represent the whole time
> since April, but they should be pretty close.
> 
> We've had 360,922 spam messages and 396,983 ham messages with a
> normalized average spam score of 6.8714134 and a normalized average ham
> score of -2.1532284.  I have the divding line "set" at 30% of the
> distance between the average ham score and average spam score (30% above
> the average ham score). So, the dividing line is currently floating
> around 0.55416414.


The only problem I see with this approach is that it treats false positives and
false negatives as being equally bad.

In general, you're adjusting the score bias so the number of FP's and FNs are
approximately equal. Although STATISTICS*.txt would suggest that this boundary
occurs somewhere near 2.0, your own local biases could change this considerably.


SA's normal scoreset is evolved with the concept that it's better to have 99
false negatives than 1 false positive. The concept here is most people use
scripts to move their spam into a separate folder, or auto delete it. With that
going on, a FP is potentially lost valid email, whereas a FN is a minor
inconvenience.

For any site that considers FPs to be "not too bad" because all mail is manually
examined anyway, lowering the score threshold may be a workable thing.

However, other sites that auto-delete such messages may have considerable
problems if they lower the threshold.



RE: Bypass URI check

2005-07-11 Thread Chris Santerre
Title: Bypass URI check



I'm thinking 
it may be time for SARE to look at this phrase:
 
"then copy // 
paste the below page into your window: 
"
 
I'll see what I can 
do with it. 
 
--Chris (I also love 
the black ;) 

  -Original Message-From: [EMAIL PROTECTED] 
  [mailto:[EMAIL PROTECTED]Sent: Monday, July 11, 2005 
  10:42 AMTo: users@spamassassin.apache.orgSubject: Bypass 
  URI check
  Hi All, 
  I have received a few messages like the 
  following.  This asks the receiver to copy and past the link into their 
  web browser.  Since the href is missing, there is no URI check.  
  That sucks, because the URIBL is my best friend right now (love black).  
  We are close to marking it and URIBL would have definitely got it over.  
  Any ideas on handling this?
  Microsoft Mail Internet Headers Version 2.0 
  Received: from .atco.ca ([xxx.xxx.10.122]) by 
  .atco.com with Microsoft SMTPSVC(5.0.2195.6713); 
   
  Mon, 11 Jul 2005 08:01:29 -0600 Received: 
  from .atco.ca ([xxx.xxx.10.122])  by .atco.ca (SMSSMTP 4.0.0.59) with SMTP id 
  M2005071108012819018  ; Mon, 11 Jul 
  2005 08:01:28 -0600 Received: from 
  [58.224.196.19] (helo=xxx.xxx.10.122) 
      by 
  .atco.ca with smtp (Exim ) 
      id 
  1Dryqd-0001kd-Sa; Mon, 11 Jul 2005 08:01:28 -0600 X-Orcpt: rfc727;zmailer-log Message-ID: 
  <[EMAIL PROTECTED]> 
  Date: Mon, 11 Jul 2005 10:58:46 -0400 
  From: "Joan Kerry "  
  To: "Joan Kerry " 
  <[EMAIL PROTECTED]> Cc: 
  [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED], 
  [EMAIL PROTECTED] Subject: Appetie 
  Suppresant Mon, 11 Jul 2005 07:04:46 -0800 MIME-Version: 1.0 Content-Type: 
  multipart/alternative;     
  boundary="--4.AMLHjl9J5.pLVdgIYrJD9" X-Spam-Checker-Version: SpamAssassin 3.0.3 (2005-04-27) on 
  .atco.ca X-Spam-Level:  
  X-Spam-Status: No, score=4.3 required=5.0 
  tests=J_CHICKENPOX_55,MANGLED_STOP, 
      RCVD_HELO_IP_MISMATCH,RCVD_NUMERIC_HELO autolearn=disabled 
      version=3.0.3 Return-Path: 
  [EMAIL PROTECTED] X-OriginalArrivalTime: 
  11 Jul 2005 14:01:29.0016 (UTC) FILETIME=[08D53380:01C58621] 
  4.AMLHjl9J5.pLVdgIYrJD9 Content-Type: text/plain;  format=flowed;  charset=iso-8859-15 Content-Transfer-Encoding: 7Bit 
  4.AMLHjl9J5.pLVdgIYrJD9 Content-Type: text/plain;  format=flowed;  charset=iso-8859-15 Content-Transfer-Encoding: 7Bit 
  4.AMLHjl9J5.pLVdgIYrJD9-- 
  [EMAIL PROTECTED] 
  Do you find it difficult to cut down on delicious 
  foods filled with carbs? such as 
  pasta,cakes,breads,potato chips and ice cream? 
  if you are one of the people...then copy // paste 
  the below page into your window: 
  slimfat.info 
  Kindly, Joan 
  Kerry - 
  2refrain: s-t-o-p.info 



sa-learn on a wide site HOWTO ?

2005-07-11 Thread Karl.Oulmi

Hi,

I always have a box with postfix/amavis and Spamassin running.
Now, I'd like to run sa-learn in order my users (~500) learn Spam & Ham 
to Spamassassin.


The idea is the following.
On every mail passed through my mailserver, a header or a footer is 
added to the mail with à mailto link that permit my users to learn 
Spamassassin if the mail is spam or not.


Does anybody has ever implemented this solution ? Do anyone has an howto 
or a good url about this subjet ?


Many thanks

KARL :)
--



smime.p7s
Description: S/MIME Cryptographic Signature


Re: Rule: envelope to <> header to - help?

2005-07-11 Thread Matt Kettler
Michael W Cocke wrote:
> Does anyone have a rule to chech the envelope To: against the header
> to: ? I'm sure that there's a reason why it's allowed to be different,
> but it doesn't apply here, and almost half of the spam that gets thru
> everything else would get stopped by that.

No. It's generally not possible, as SA does not have access to the envelope.

Also, bear in mind that there are LOTS of reasons why it would be allowed to be
different, and your location is not likely to be an exception, despite what you
think.

For example, all posts sent to any mailing list, including this mailing list,
will mismatch.

Unless your site does ALL of the following, you'll have mismatches:

-No users subscribe to ANY mailing lists, including listservs as well as
commercial newsletters, etc.
-No users receive mail from anyone that uses bcc.
-No users may have mail redirected from another account (ie: auto-forward from
yahoo)

(need I go on?) 

In general the only systems that won't get mismatches between the envelope and
the to: are systems that don't receive any internet mail except single-user to
single-user messages. And that's got to be strict, NO other internet email but
single-user to single-user.


Bypass URI check

2005-07-11 Thread Martin.Carnegie
Title: Bypass URI check






Hi All,


I have received a few messages like the following.  This asks the receiver to copy and past the link into their web browser.  Since the href is missing, there is no URI check.  That sucks, because the URIBL is my best friend right now (love black).  We are close to marking it and URIBL would have definitely got it over.  Any ideas on handling this?

Microsoft Mail Internet Headers Version 2.0

Received: from .atco.ca ([xxx.xxx.10.122]) by .atco.com with Microsoft SMTPSVC(5.0.2195.6713);

 Mon, 11 Jul 2005 08:01:29 -0600

Received: from .atco.ca ([xxx.xxx.10.122])

 by .atco.ca (SMSSMTP 4.0.0.59) with SMTP id M2005071108012819018

 ; Mon, 11 Jul 2005 08:01:28 -0600

Received: from [58.224.196.19] (helo=xxx.xxx.10.122)

    by .atco.ca with smtp (Exim )

    id 1Dryqd-0001kd-Sa; Mon, 11 Jul 2005 08:01:28 -0600

X-Orcpt: rfc727;zmailer-log

Message-ID: <[EMAIL PROTECTED]>

Date: Mon, 11 Jul 2005 10:58:46 -0400

From: "Joan Kerry " 

To: "Joan Kerry " <[EMAIL PROTECTED]>

Cc: [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED]

Subject: Appetie Suppresant Mon, 11 Jul 2005 07:04:46 -0800

MIME-Version: 1.0

Content-Type: multipart/alternative;

    boundary="--4.AMLHjl9J5.pLVdgIYrJD9"

X-Spam-Checker-Version: SpamAssassin 3.0.3 (2005-04-27) on .atco.ca

X-Spam-Level: 

X-Spam-Status: No, score=4.3 required=5.0 tests=J_CHICKENPOX_55,MANGLED_STOP,

    RCVD_HELO_IP_MISMATCH,RCVD_NUMERIC_HELO autolearn=disabled 

    version=3.0.3

Return-Path: [EMAIL PROTECTED]

X-OriginalArrivalTime: 11 Jul 2005 14:01:29.0016 (UTC) FILETIME=[08D53380:01C58621]


4.AMLHjl9J5.pLVdgIYrJD9

Content-Type: text/plain;

 format=flowed;

 charset=iso-8859-15

Content-Transfer-Encoding: 7Bit


4.AMLHjl9J5.pLVdgIYrJD9

Content-Type: text/plain;

 format=flowed;

 charset=iso-8859-15

Content-Transfer-Encoding: 7Bit



4.AMLHjl9J5.pLVdgIYrJD9--



[EMAIL PROTECTED]


Do you find it difficult to cut down on delicious foods filled with carbs? 

such as pasta,cakes,breads,potato chips and ice cream?



if you are one of the people...then copy // paste the below page into your window:


slimfat.info


Kindly,

Joan Kerry 

-


2refrain: s-t-o-p.info





Re: SURBL & SA 3.0.4

2005-07-11 Thread Matt Kettler
Dr Robert Young wrote:
> Is there a particular "port" and/or "protocol (TCP/UDP) that must be
> opened on any firewalls that might be on the network for the plugin to
> work?

You don't "need" to open any ports, however you must be able to resolve DNS
queries.

In general you can test it by using "host www.spamassassin.org".. if you get an
answer back, DNS works. If not, DNS doesn't.


In general your nameserver must be able to perform queries to port 53 as a UDP
client. If your firewall is stateful, you only need to open it in the outbound
direction (if you've locked down outbound traffic at all). If it's a stateless
packet filter, then you'll need to open both.


You can set what nameservers your SA box will use in in /etc/resolv.conf.


Re: (repost) bayes_ignore_from with wildcard ?

2005-07-11 Thread Matt Kettler

At 04:43 AM 7/11/2005, [EMAIL PROTECTED] wrote:

Hello,

Does anyone know if this will work:

bayes_ignore_from  [EMAIL PROTECTED]

The docs don't say specifically if this kind of directive is allowed.
They do say that this kind of thing will work for whitelist_from.


We all got your message the first time. No, I don't know. But from a casual 
glance at the code, it should work.


conf.pm builds a list named bayes_ignore_from.

bayes.pm calls:

$ignore = $PMS->check_from_in_list('bayes_ignore_from')
|| $PMS->check_to_in_list('bayes_ignore_to');


check_from_in_list is actually implemented in EvalTests.pm:


sub check_from_in_list {
  my ($self,$list) = @_;
  my $list_ref = $self->{conf}{$list};
  warn "Could not find list $list" unless defined $list_ref;

  foreach my $addr (all_from_addrs $self) {
return 1 if _check_whitelist $self $list_ref, $addr;
  }

  return 0;
}


_check_whitelist is the same comparison function the black and whitelists 
use, so it should work the same.


Although by looking at _check_whitelist, I wonder if it works the way the 
docs say. The docs claim it's file glob and not regex, but _check_whitelist 
looks a lot like it does a regex.








Re: simultaneous sa-learn processes

2005-07-11 Thread Chavdar Videff
On Monday 11 July 2005 15:31, Kai Schaetzl wrote:
> Chavdar Videff wrote on Mon, 11 Jul 2005 13:40:14 +0300:
> > If I got it right, we should run sa-learn for each user in order to
> > benefit from bayes. We intend to run a cron job for each user and do it
> > at night by supplying a daily snapshot of our spam and ham collections to
> > sa-learn.
>
> Do I understand you correctly? You use Bayes for each user, but you want to
> sa-learn each of them the same daily corpus? This means the only difference
> in the user's Bayes db's will be auto-learned mail or mail learned by those
> users (if anything of that is possible/allowed with your setup). Doesn't
> look too useful to me. If most of the db content is the same then you could
> just use a site-wide db. Also, Bayes gets better the more mail it gets. If
> your users don't get many mail their individual Bayes db's won't be very
> effective. I'm all for using site-wide Bayes unless you users get really a
> lot of mail (I'd say at least 100 mails per user per day).
>
> Kai
I thought it was installed site-wide, however the only bayes db's I find on 
the system are in each user's ~/.spamassassin folder. And indeed, the only 
way I can make bayes learn is by teaching it on a per-user basis. For quite a 
few months I collected spam, feeded it to sa-learn and finially reading this 
list relized that all I did was teach root's database. Everybody else did not 
benefit from bayes which was screwd because of autolearning a lot of spam to 
be ham. 
If there is a way to set up a single bayes database I would prefer that, for 
the scenario I am posting about does not make me happy (running 100 sa-learns 
at night).
Thanks
Chavdar



Re: simultaneous sa-learn processes

2005-07-11 Thread Kai Schaetzl
Chavdar Videff wrote on Mon, 11 Jul 2005 13:40:14 +0300:

> If I got it right, we should run sa-learn for each user in order to benefit 
> from bayes. We intend to run a cron job for each user and do it at night by 
> supplying a daily snapshot of our spam and ham collections to sa-learn.

Do I understand you correctly? You use Bayes for each user, but you want to 
sa-learn each of them the same daily corpus? This means the only difference in 
the user's Bayes db's will be auto-learned mail or mail learned by those users 
(if anything of that is possible/allowed with your setup). Doesn't look too 
useful to me. If most of the db content is the same then you could just use a 
site-wide db. Also, Bayes gets better the more mail it gets. If your users 
don't get many mail their individual Bayes db's won't be very effective. I'm 
all for using site-wide Bayes unless you users get really a lot of mail (I'd 
say at least 100 mails per user per day).

Kai

-- 
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com
IE-Center: http://ie5.de & http://msie.winware.org





Re: simultaneous sa-learn processes

2005-07-11 Thread Chavdar Videff
On Monday 11 July 2005 14:50, JamesDR wrote:
> Chavdar Videff wrote:
> > Hi List,
> >
> > Our mailserver server serves about 100 users. Our config:
> > Sendmail+Procmail+SpamAssassin.
> > The question is:
> > If I got it right, we should run sa-learn for each user in order to
> > benefit from bayes. We intend to run a cron job for each user and do it
> > at night by supplying a daily snapshot of our spam and ham collections to
> > sa-learn. Can our mailserver handle it (256 MB RAM, Celeron 400 Mhz)?
> > A weekly collection run for 1 user usually eats 100% of CPU load. My
> > concern is whether the system is going to crash or just do the job slower
> > and if you can point out how many sa-learn tasks could we run
> > simultaneously with our setup.
> > All hints will be appreciated, for we scheduled an initial load for 16
> > users of the big collection of spam received so far.
> >
> > Thanks guys
> >
> > Chavdar Videff
>
> What kind of Bayes db are you using? We use MySQL here and haven't seen
> SA-Learn use up that much cpu... I've run it manually up to 10 processes
> at once without any noticeable slowing of the machine. (p2 450mhz, 256mb)

I guess it is BerkeleyDB, the default installation on Debian. The ineteresting 
part is that while testing cron on one user the cpu fall was not noticeable. 

Chavdar Videff


RE: simultaneous sa-learn processes

2005-07-11 Thread Sander Holthaus - Orange XL
JamesDR wrote:
> Chavdar Videff wrote:
>> Hi List,
>> 
>> Our mailserver server serves about 100 users. Our config:
>> Sendmail+Procmail+SpamAssassin.
>> The question is:
>> If I got it right, we should run sa-learn for each user in order to
>> benefit from bayes. We intend to run a cron job for each user and do
>> it at night by supplying a daily snapshot of our spam and ham
>> collections to sa-learn. Can our mailserver handle it (256 MB RAM,
>> Celeron 400 Mhz)?

Why would you want to setup Bayes on a per user basis if you are going to
feeed it system-wide hams and spams? Especially feeding it systemwide hams
is odd.
 
>> A weekly collection run for 1 user usually eats 100% of CPU load. My
>> concern is whether the system is going to crash or just do the job
>> slower and if you can point out how many sa-learn tasks could we run
>> simultaneously with our setup.

Systems shouldn't crash under high load, so that's not a real concern. If it
does happen, you have a more serious problems elswhere. What would be more
of a concern is how it is going to affect other processes running on your
system. Slower is not a problem, but if you really put the load on your box
from a lot of processes, you might start seeing time-outs.

>> All hints will be appreciated, for we scheduled an initial load for
>> 16 users of the big collection of spam received so far.

If your are going to simultaniously learn spam and ham for 16 users, and
want to keep running your mailserver/spamassassin too (it take you also have
a virusscanner running somewhere), I would consider at least running the
sa-learn processes under nice to keep them from stalling more essential
services. But, depending on your System setup (OS, DB, etc) you might want
to cut down a little on the number of processes run simultaniously. 

>> 
>> Thanks guys
>> 
>> Chavdar Videff
>> 
>> 
> What kind of Bayes db are you using? We use MySQL here and
> haven't seen SA-Learn use up that much cpu... I've run it
> manually up to 10 processes at once without any noticeable
> slowing of the machine. (p2 450mhz, 256mb)




Re: simultaneous sa-learn processes

2005-07-11 Thread JamesDR

Chavdar Videff wrote:

Hi List,

Our mailserver server serves about 100 users. Our config: 
Sendmail+Procmail+SpamAssassin.

The question is:
If I got it right, we should run sa-learn for each user in order to benefit 
from bayes. We intend to run a cron job for each user and do it at night by 
supplying a daily snapshot of our spam and ham collections to sa-learn.

Can our mailserver handle it (256 MB RAM, Celeron 400 Mhz)?
A weekly collection run for 1 user usually eats 100% of CPU load. My concern 
is whether the system is going to crash or just do the job slower and if you 
can point out how many sa-learn tasks could we run simultaneously with our 
setup.
All hints will be appreciated, for we scheduled an initial load for 16 users 
of the big collection of spam received so far.


Thanks guys

Chavdar Videff


What kind of Bayes db are you using? We use MySQL here and haven't seen 
SA-Learn use up that much cpu... I've run it manually up to 10 processes 
at once without any noticeable slowing of the machine. (p2 450mhz, 256mb)


--
Thanks,
James



simultaneous sa-learn processes

2005-07-11 Thread Chavdar Videff
Hi List,

Our mailserver server serves about 100 users. Our config: 
Sendmail+Procmail+SpamAssassin.
The question is:
If I got it right, we should run sa-learn for each user in order to benefit 
from bayes. We intend to run a cron job for each user and do it at night by 
supplying a daily snapshot of our spam and ham collections to sa-learn.
Can our mailserver handle it (256 MB RAM, Celeron 400 Mhz)?
A weekly collection run for 1 user usually eats 100% of CPU load. My concern 
is whether the system is going to crash or just do the job slower and if you 
can point out how many sa-learn tasks could we run simultaneously with our 
setup.
All hints will be appreciated, for we scheduled an initial load for 16 users 
of the big collection of spam received so far.

Thanks guys

Chavdar Videff


Re: How can I filter this kind of spam?

2005-07-11 Thread Michael Moyse

Kai Schaetzl wrote:


Michael Moyse wrote on Fri, 08 Jul 2005 17:55:32 +0100:

 

To me it looks like a duck and sounds like a duck  I'm probably wrong 
and missing something here because I'm no expert so I'm happy to be 
enlightened.
   



Ok, I enlighten you ;-) I hope I'm not wrong. Now that I look again at the 
headers it turns out I was wrong as well, see below.


From the headers:

Received: (qmail 10812 invoked by uid 567); 5 Jul 2005 12:03:20 - 
Received: from 65.33.195.76 by host1 (envelope-from 
<[EMAIL PROTECTED]>, uid 502) with 
qmail-scanner-1.25 
(clamdscan: 0.86.1/967. spamassassin: 3.0.4.   
Clear:RC:0(65.33.195.76):SA:0(0.0/1.5):. 
Processed in 0.44071 secs); 05 Jul 2005 12:03:20 - 
Received: from unknown (HELO ss) (65.33.195.76) 
by 0 with SMTP; 5 Jul 2005 12:03:19 - 

 


65.33.195.76 = 76.195.33.65.cfl.res.rr.com !
 



Received: from vitalmex.com.mx (mail1.vitalmex.com.mx [148.223.241.181]) 
by 76.195.33.65.cfl.res.rr.com (Pastfix) with ESMTP id 0456EDBA28 
for <[EMAIL PROTECTED]>; Tue, 05 Jul 2005 05:21:23 -0700 


The mail went:
vitalmex -> Roadrunner (Po/astfix) -> boom-edv.de (qmail)
The last Received line looks forged (Pastfix), there's also no SMTP 
running at 76.195.33.65.cfl.res.rr.com (=no open/abusable relay). This 
suggests that the mail was sent out directly from that roadrunner account 
and the last Received plus all vitalmex stuff is completely forged. Also, 
a spammer which abused a Roadrunner account would obviously not send 
openly from his own MX and giving you a return-path which leads back to 
him.


So, what you actually have to block is .rr.com and not .vitalmex.com.mx or 
.mx. This mail would have never reached us, because we already block all 
of .rr.com :-)



Kai

 


Cool! Thanks for the explanation


bayes_ignore_from with wildcard ?

2005-07-11 Thread lists

Hello,

Does anyone know if this will work:

bayes_ignore_from  [EMAIL PROTECTED]

The docs don't say specifically if this kind of directive is allowed.
They do say that this kind of thing will work for whitelist_from.

Regards,
Devin