bayes,imp and virtual users

2009-03-04 Thread Seba Mueld

Hi,

I'm using SA 3.2.5 with Horde/IMP ans Postfix 2.5.5 with virtual users. My 
config is similar like this:
http://wiki.apache.org/spamassassin/IntegratedSpamdInPostfix

I want to use bayes (SQL) auto learn with virtual users and this works as long 
as clients send through SMTP and an real email client. When users send 
through webmail (IMP) the username is not correct taken from spamc and mails 
gets learned by the wrong user. It seems that spamc takes the recipient of the 
email as sender (bayes learns spam/ham for the recipient of the mail).

Config files


Postfix master.cf:

# Eingehende E-Mails - tux.linuxmail.at (MX)
xx.xx.xx.xx:25  inet  n   -   n   -   -   smtpd
-o content_filter=spamassassin

spamassassin unix - n   n   -   -   pipe
flags=Rq user=vmail argv=/usr/bin/spamc -u ${us...@${domain} -e 
/usr/sbin/sendmail -oi -f ${sender} ${recipient}

Spamassassin local.cf:

use_bayes 1
bayes_store_module  Mail::SpamAssassin::BayesStore::SQL
bayes_sql_dsn   DBI:mysql:spamassassin:localhost
bayes_sql_username  user
bayes_sql_password  pwd
bayes_auto_learn 1

spamd options (Debian-based 7etc/default/spamassassin):

OPTIONS=--max-children 5 -d -q -u vmail --nouser-config 
--virtual-config-dir=/home/vmail/%d/%l

When then a user sends an email from Webmail to a recipient I see the 
recipients email address with spam and ham count in my bayes MySQL DB but the 
mail should get learned for the user who has send the mail.

Seba





blacklist_from

2009-03-04 Thread Geert Batsleer
 all,

I'm trying to blacklist email frcm '*Vegas Club Casino'  *Hi wich is being
sent  from different email adressess but always with the same 'From'  in the
header.

Tried putting it in local.cf as blacklist_from Vegas Club Casino
but those mails keep coming.

How can I filter just on the from tag without using an email adress but the
name.

best regards,

Geert

PS I'm using p spamassassin-3.2.5-1.el4.rf  with score 3


Re: blacklist_from

2009-03-04 Thread Matus UHLAR - fantomas
On 04.03.09 10:10, Geert Batsleer wrote:
 I'm trying to blacklist email frcm '*Vegas Club Casino'  *Hi wich is being
 sent  from different email adressess but always with the same 'From'  in the
 header.
 
 Tried putting it in local.cf as blacklist_from Vegas Club Casino
 but those mails keep coming.

If you look at the docs, you'll see that *blacklist_* only apply for
adresses

 How can I filter just on the from tag without using an email adress but the
 name.

a rule will be needed

-- 
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
We are but packets in the Internet of life (userfriendly.org)


Re: blacklist_from

2009-03-04 Thread Yet Another Ninja

On 3/4/2009 10:18 AM, Matus UHLAR - fantomas wrote:

On 04.03.09 10:10, Geert Batsleer wrote:

I'm trying to blacklist email frcm '*Vegas Club Casino'  *Hi wich is being
sent  from different email adressess but always with the same 'From'  in the
header.

Tried putting it in local.cf as blacklist_from Vegas Club Casino
but those mails keep coming.


If you look at the docs, you'll see that *blacklist_* only apply for
adresses


How can I filter just on the from tag without using an email adress but the
name.


a rule will be needed



header FROM_BLAHFrom:name =~ /\bBLAH\b/i

should do the trick


Re: 2 + 2 != 4 - Spamassassin needs a new paradigm

2009-03-04 Thread Matus UHLAR - fantomas
 Le 03/03/2009 17:42, Matus UHLAR - fantomas a écrit :
 I have been already thinking about possibility to combine every two
 rules and do a masscheck over them. Then, optionally repeating that
 again, skipping duplicates. Finally gather all rules that scored=0.5
 ||=-0.5 - we could have interesting ruleset here.
 
 But that's going to be a HUGE ruleset.

 On Mar 3, 2009, at 10:06, John Wilcock j...@tradoc.fr wrote:
 Not to mention that different combinations will suit different sites.
 
 I wonder about the feasibility of a second Bayesian database, using 
 the same learning mechanism as the current system, but keeping track 
 of rule combinations instead of keywords.

 LuKreme wrote:
 It sounds like a really good idea to me, and also like the most 
 reasonable way to manage self-learning meta rules.

On 03.03.09 16:43, Marc Perkel wrote:
 It seems to me that the consensus is that it's worth a try. I don't know 
 if it will work or not but I think there's a good change this could be a 
 significant advancement in how well SA works.

I should note that some policy rules and rules with manually updated scores
(SPF_PASS, BAYES_*) may need to be exempted from this.
We don't want SPF_PASS to generate high positive score, do we?
-- 
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
M$ Win's are shit, do not use it !


Re: 2 + 2 != 4 - Spamassassin needs a new paradigm

2009-03-04 Thread Justin Mason
On Wed, Mar 4, 2009 at 00:43, Marc Perkel m...@perkel.com wrote:
 LuKreme wrote:

 On Mar 3, 2009, at 10:06, John Wilcock j...@tradoc.fr wrote:

 Le 03/03/2009 17:42, Matus UHLAR - fantomas a écrit :

 I have been already thinking about possibility to combine every two
 rules
 and do a masscheck over them. Then, optionally repeating that again,
 skipping duplicates. Finally gather all rules that scored=0.5 ||=-0.5
 - we could have interesting ruleset here.

 But that's going to be a HUGE ruleset.

 Not to mention that different combinations will suit different sites.

 I wonder about the feasibility of a second Bayesian database, using the
 same learning mechanism as the current system, but keeping track of rule
 combinations instead of keywords.

 It sounds like a really good idea to me, and also like the most reasonable
 way to manage self-learning meta rules.

 It seems to me that the consensus is that it's worth a try. I don't know if
 it will work or not but I think there's a good change this could be a
 significant advancement in how well SA works.

So you're volunteering to code it up, then? ;)

--j.


Re: 2 + 2 != 4 - Spamassassin needs a new paradigm

2009-03-04 Thread John Wilcock

Le 04/03/2009 10:38, Matus UHLAR - fantomas a écrit :

I should note that some policy rules and rules with manually updated scores
(SPF_PASS, BAYES_*) may need to be exempted from this.
We don't want SPF_PASS to generate high positive score, do we?


It could probably be argued both ways. There might be advantages in 
letting the postulated system give a positive boost to high-confidence 
spam indicators even if (or perhaps particularly when) they occur in 
combination with rules that are low-confidence ham indicators like SPF_PASS.


But I guess these sort of details would need to be investigated by 
whoever takes on the task of designing and coding the system. It would 
no doubt take some fairly complex statistical analysis of different 
possible strategies to implement this idea. I for one have neither the 
time nor the expertise, unfortunately, to do much more than express an idea!


John.

--
-- Over 3000 webcams from ski resorts around the world - www.snoweye.com
-- Translate your technical documents and web pages- www.tradoc.fr


Re: 2 + 2 != 4 - Spamassassin needs a new paradigm

2009-03-04 Thread decoder

Justin Mason wrote:


So you're volunteering to code it up, then? ;)


I was planning to do at least some brainstorming+experiements as to what 
learning methods would seem suitable and how well the method performs, 
whenever I have time again. Unless someone else did that already?




smime.p7s
Description: S/MIME Cryptographic Signature


Re: Bye Bye Bayes

2009-03-04 Thread Kai Schaetzl
LuKreme wrote on Tue, 3 Mar 2009 19:02:06 -0700:

 How is it the same? Already read messages in inbox means the user has  
 accepted those messages without trashing them or junking them.

and the message may not have been learned by score.
If you can make sure that your users *really* delete or move spam to the 
right places, then it works, yes. But I fear there is a chance that users 
just walk over spam and let it stay as (depending on the mail client) it 
may just not be visible anymore which may be good enough for them.
So, there's a chance of undesired infection with spam.

 False junk would get pulled out of .Junk into the inbox and relearned  
 as ham.

How? By the user? When? What about vacation?
I wouldn't trust too much that users do the right thing. Depends on your 
user base.

Kai

-- 
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com





Re: Bye Bye Bayes

2009-03-04 Thread John Hardin

On Tue, 3 Mar 2009, LuKreme wrote:


On Mar 3, 2009, at 17:07, John Hardin jhar...@impsec.org wrote:


On Tue, 3 Mar 2009, LuKreme wrote:

 I am considering the following:
 
 Autolearn read mail in the inbox as ham

 Autolearn mail in .Junk and .SPAM as spam
 
 This is pretty east with maildir.


How is that different from using the built-in autolearning based on 
message score?


How is it the same? Already read messages in inbox means the user has 
accepted those messages without trashing them or junking them.


Sorry, I didn't register that part. I thought it was just messages in the 
inbox.


Bear in mind some mail clients will mark a message read if you only 
highlight the title line. Auto-preview can be annoying that way sometimes.


.Junk means the user, or the user's MUA, has flagged a message that is 
not tagged as spam.


Okay, I was assuming that was your SA spam quarantine, not your equivalent 
of the user's spam training folder.


False junk would get pulled out of .Junk into the inbox and relearned as 
ham.


Haven't done it, still mulling.


Now that you've explained it in more detail it sounds better.

--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  Failure to plan ahead on someone else's part does not constitute
  an emergency on my part. -- David W. Barts in a.s.r
---
 4 days until Daylight Saving Time begins in U.S. - Spring Forward


Re: Bye Bye Bayes

2009-03-04 Thread John Hardin

On Wed, 4 Mar 2009, Kai Schaetzl wrote:


LuKreme wrote on Tue, 3 Mar 2009 19:02:06 -0700:


How is it the same? Already read messages in inbox means the user has
accepted those messages without trashing them or junking them.


If you can make sure that your users *really* delete or move spam to the 
right places, then it works, yes.


That, of course, is the crux of the biscuit.

I used to have a couple of users who treated their Trash folder as 
long-term read-message storage. After reading most messages they'd move 
them to Trash, and _never_ _purge_ _it_. I couldn't break them of this 
habit, even after purging their Trash folder from the server a couple of 
times. (Oops! Disk failure! Well, that was trash, you can afford to lose 
that.)


But I fear there is a chance that users just walk over spam and let it 
stay as (depending on the mail client) it may just not be visible 
anymore which may be good enough for them.


Or delete it rather than moving it to .Junk

I'll modify my earlier comment - it sounds good, assuming you have a high 
degree of users behaving they way you want them to.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  Failure to plan ahead on someone else's part does not constitute
  an emergency on my part. -- David W. Barts in a.s.r
---
 4 days until Daylight Saving Time begins in U.S. - Spring Forward


Re: 2 + 2 != 4 - Spamassassin needs a new paradigm

2009-03-04 Thread Marc Perkel



Matus UHLAR - fantomas wrote:

I should note that some policy rules and rules with manually updated scores
(SPF_PASS, BAYES_*) may need to be exempted from this.
We don't want SPF_PASS to generate high positive score, do we?
  


The idea of all this is that we might discover things like SPF_PASS 
combined with other rules might be useful where by itself it's not. We 
might find ourselves generating more informational tokens that by 
themselve don't score but are useful in combination with other rules.


Re: 2 + 2 != 4 - Spamassassin needs a new paradigm

2009-03-04 Thread Marc Perkel



Justin Mason wrote:

On Wed, Mar 4, 2009 at 00:43, Marc Perkel m...@perkel.com wrote:
  

LuKreme wrote:


On Mar 3, 2009, at 10:06, John Wilcock j...@tradoc.fr wrote:

  

Le 03/03/2009 17:42, Matus UHLAR - fantomas a écrit :


I have been already thinking about possibility to combine every two
rules
and do a masscheck over them. Then, optionally repeating that again,
skipping duplicates. Finally gather all rules that scored=0.5 ||=-0.5
- we could have interesting ruleset here.

But that's going to be a HUGE ruleset.
  

Not to mention that different combinations will suit different sites.

I wonder about the feasibility of a second Bayesian database, using the
same learning mechanism as the current system, but keeping track of rule
combinations instead of keywords.


It sounds like a really good idea to me, and also like the most reasonable
way to manage self-learning meta rules.

  

It seems to me that the consensus is that it's worth a try. I don't know if
it will work or not but I think there's a good change this could be a
significant advancement in how well SA works.



So you're volunteering to code it up, then? ;)

--j.

  

I would if I were any good at perl.


Dealing with low scoring spam - tighter MTA integration [was: 2 + 2 != 4 - Spamassassin needs a new paradigm]

2009-03-04 Thread Andrzej Adam Filip
Karsten Bräckelmann guent...@rudersport.de wrote:

 On Tue, 2009-03-03 at 08:32 -0800, Marc Perkel wrote:
 Spamassassin works by adding up points. Rule A is 2 points, Rule B is 2 
 points therefore the score is 4 points. But is this the best way to 
 score? I don't think so.
 [...]
 Anyhow - just throwing this out there for people to chew on and think about.

 Oh, and another problem with this:

 About 98-99% of my spam in-stream scores as high, that any such proposal
 results in a useless increase of the score.

 The problem lies with the LOW scoring spam. Alas, these do not tend to
 trigger on a solid subset or meta as you proposed. In particular, RBL
 hits are quite rare, even more so for multiple hits. The few rules hit
 by low scorers are quite diverse, which complicates this.

May be spamassassin should create set of tests intended for use before
replying RCPT TO: in SMTP session?
[ test based on: sending IP address, envelope sender, envelope
recipient, and name in helo/ehlo ]

Possible recommended actions:  accept, temporary reject, permanent
reject - with choice based on spam score *AND* mail source reputation.

Temporary reject in SMTP session should increase chances of DNSBL hits
by reducing blind spot period of newly created spam sources.

-- 
[plen: Andrew] Andrzej Adam Filip : a...@onet.eu
The difference between science and the fuzzy subjects is that science
requires reasoning while those other subjects merely require scholarship.
  -- Robert Heinlein


Re: Bye Bye Bayes

2009-03-04 Thread Kai Schaetzl
John Hardin wrote on Wed, 4 Mar 2009 06:17:16 -0800 (PST):

 (Oops! Disk failure! Well, that was trash, you can afford to lose 
 that.)

thanks for the laugh :-)

Kai

-- 
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com





Re: Dealing with low scoring spam - tighter MTA integration [was: 2 + 2 != 4 - Spamassassin needs a new paradigm]

2009-03-04 Thread Karsten Bräckelmann
On Wed, 2009-03-04 at 16:02 +0100, Andrzej Adam Filip wrote:
 Karsten Bräckelmann guent...@rudersport.de wrote:

  About 98-99% of my spam in-stream scores as high, that any such proposal
  results in a useless increase of the score.
 
  The problem lies with the LOW scoring spam. Alas, these do not tend to
  trigger on a solid subset or meta as you proposed. In particular, RBL
  hits are quite rare, even more so for multiple hits. The few rules hit
  by low scorers are quite diverse, which complicates this.
 
 May be spamassassin should create set of tests intended for use before
 replying RCPT TO: in SMTP session?
 [ test based on: sending IP address, envelope sender, envelope
 recipient, and name in helo/ehlo ]

This would be an entirely different application, not SA, wouldn't it?

Well, this probably could be done in SA using a multi-level protocol
capable of returning values at different stages. However, this seems
perfectly suited for a lightweight tool, rather than a hog that is
designed to scan and process entire messages. :)


 Possible recommended actions:  accept, temporary reject, permanent
 reject - with choice based on spam score *AND* mail source reputation.
 
 Temporary reject in SMTP session should increase chances of DNSBL hits
 by reducing blind spot period of newly created spam sources.

Experience with grey-listing, tempfail or whatever varies wildly given
the posts to this list. Some do report, that the zombies won't retry
anyway after being tempfailed once. So a later DNSBL hit after the list
catching up and DNS propagation may be even irrelevant.


-- 
char *t=\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4;
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;il;i++){ i%8? c=1:
(c=*++x); c128  (s+=h); if (!(h=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}



Re: Bye Bye Bayes

2009-03-04 Thread Dave Pooser
 I used to have a couple of users who treated their Trash folder as
 long-term read-message storage.

I have a user like that at $DAYJOB. I used to ask him if he kept his car
title and other important documents in the wastebasket under his desk at
home.
-- 
Dave Pooser
Cat-Herder-in-Chief, Pooserville.com
Sarcasm Error:
Abort, Retry, Bite Me?
-Legostar Galactica




Re: Dealing with low scoring spam - tighter MTA integration

2009-03-04 Thread Andrzej Adam Filip
Karsten Bräckelmann guent...@rudersport.de wrote:

 On Wed, 2009-03-04 at 16:02 +0100, Andrzej Adam Filip wrote:
 Karsten Bräckelmann guent...@rudersport.de wrote:

  About 98-99% of my spam in-stream scores as high, that any such proposal
  results in a useless increase of the score.
 
  The problem lies with the LOW scoring spam. Alas, these do not tend to
  trigger on a solid subset or meta as you proposed. In particular, RBL
  hits are quite rare, even more so for multiple hits. The few rules hit
  by low scorers are quite diverse, which complicates this.
 
 May be spamassassin should create set of tests intended for use before
 replying RCPT TO: in SMTP session?
 [ test based on: sending IP address, envelope sender, envelope
 recipient, and name in helo/ehlo ]

 This would be an entirely different application, not SA, wouldn't it?

It can be developed using the same spam score logic, based subset of
all tests requiring only the subset of final data available during
classic run.

I do think that promoting tools that encourage postmaster to care very
much about mail server (IP address) reputation can make real difference
e.g. caring to be above reputation none in DNSWL to avoid grey-listing.

 Well, this probably could be done in SA using a multi-level protocol
 capable of returning values at different stages. However, this seems
 perfectly suited for a lightweight tool, rather than a hog that is
 designed to scan and process entire messages. :)

During initial tests/deployment *much* simpler implementation can be
used with recommended action based on spam score:

It would require redesign of 50_scores.cf structure.
  e.g. instead of
score RCVD_IN_DNSWL_HI 0 -8 0 -8
  something like that
# N - Network, B - Bayes, nX - no X, R - RCPT TO:
score RCVD_IN_DNSWL_HI nNnB=0 NnB=-8 nNB=0 NB=-8 R=-8
  or shorter
score RCVD_IN_DNSWL_HI N=-8 R=-8

 Possible recommended actions:  accept, temporary reject, permanent
 reject - with choice based on spam score *AND* mail source reputation.
 
 Temporary reject in SMTP session should increase chances of DNSBL hits
 by reducing blind spot period of newly created spam sources.

 Experience with grey-listing, tempfail or whatever varies wildly given
 the posts to this list. Some do report, that the zombies won't retry
 anyway after being tempfailed once. So a later DNSBL hit after the list
 catching up and DNS propagation may be even irrelevant.

There are DUL zombies that effectively do frequent IP address hoping
and  static NAT zombies. The former are bigger in number, the later
produce higher spam volume (IMHO).

-- 
[plen: Andrew] Andrzej Adam Filip : a...@onet.eu
All the taxes paid over a lifetime by the average American are spent by
the government in less than a second.
  -- Jim Fiebig


Re: Bye Bye Bayes

2009-03-04 Thread Martin Gregorie
On Wed, 2009-03-04 at 16:31 +0100, Kai Schaetzl wrote:
 John Hardin wrote on Wed, 4 Mar 2009 06:17:16 -0800 (PST):
 
  (Oops! Disk failure! Well, that was trash, you can afford to lose 
  that.)
 
 thanks for the laugh :-)

How many of you have seen the BOFH (Bastard Operator From Hell) stories?
http://www.theregister.co.uk/odds/bofh/

They may amuse some of you...


Martin




Re: Bye Bye Bayes

2009-03-04 Thread Kai Schaetzl
Karsten Bräckelmann wrote on Wed, 04 Mar 2009 02:25:51 +0100:

 That's bayes_auto_learn_threshold_spam and nonspam respectively, I
 guess? Keep in mind that threshold is not the actual score, so you
 aren't learning all spam with a score of 8+ then.

Right, I know. That's where the spam quarantine comes into play. All spam 
in it (= everything with score 5 or higher) gets learned in the night.
That's absolutely necessary as we don't get much spam. 96% of the mail 
that is accepted is ham (or spam that comes in because the user opted out, 
there's no distinction because there's no detection). The remainder is 
either a virus or other bad content or High Scoring spam. Low scoring spam 
is almost non-existent.

 
 Kai, given a nonspam threshold of -2, how exactly do you (manually)
 learn ham? That would be interesting. And what's the ham/spam ratio?

I just checked and have to admit we must have removed the
bayes_auto_learn_threshold_ham -2
some time ago as 0.01 seems to be reliable enough. Only the 
bayes_auto_learn_threshold_spam 8
is in effect now.
But I believe -2 would also deliver enough ham for autolearning. Score 
distribution of the last 40.000 or so messages on the same server.

-15 6 
-4 3,364 
-3 4,249 
-2 9,982 
-1 4,760 
0 13,995 
1 1,267 
2 789 
3 387 

Bayes from that machine:

0.000  0  66285  0  non-token data: nspam
0.000  0  85888  0  non-token data: nham
0.000  01864402  0  non-token data: ntokens

As you see, because of the structure of the incoming mail, the ham exceeds 
the spam and the gap is probably steadily growing. This is also reflected 
in the rule hits. The no. 1 rule that hits is Bayes_00 (it hits 99.7% of 
all ham). Bayes_99 is only at around position 25, but with a 100% accuracy 
and the no. 1 rule hitting spam (hitting about 50% of spam).

On servers where I get in some spam trap email and let part of it flow 
thru the MTA rejection the picture is very different. For instance the 
server for my own domains has only 25% ham. Bayes_99 is the no. 1 hitting 
rule with an accuracy of 95.8% (again, not checked if the remainder really 
was ham). With all the URIBL rules and BAYES_00 (accuracy 99.9%) as 
runners up.

So, all in all Bayes works very much for me. Especially in those cases 
where no other rule hits (typically some spamvertized site not yet on a 
URIBL) it's most often the only rule that hits. That's why I moved it to 
5.0 a while ago. Works very well. I think if you use DCC or Razor you may 
get similar results for these rules and may not need to rely so much on 
Bayes. I do not use *any* network rules except for the URIBL stuff which 
isn't shut off by skip_rbl_checks 1.

(Figures are taken from mailwatch rule hits tables.)

Kai

-- 
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com





Re: 2 + 2 != 4 - Spamassassin needs a new paradigm

2009-03-04 Thread decoder

Marc Perkel wrote:


Justin Mason wrote:


So you're volunteering to code it up, then? ;)

--j.

  

I would if I were any good at perl.


I think we should evaluate if the suggested technique works and performs 
better or is at least of some benefit, before trying to implement it 
properly as a plugin. Such a test can be done offline with spam/ham 
easily... I started writing a script that mines some of my spam and ham, 
and then I'll evaluate how good the classifiers are that I get.



Cheers,


Chris


smime.p7s
Description: S/MIME Cryptographic Signature


Re: 2 + 2 != 4 - Spamassassin needs a new paradigm

2009-03-04 Thread LuKreme

On 4-Mar-2009, at 02:38, Matus UHLAR - fantomas wrote:

LuKreme wrote:

It sounds like a really good idea to me, and also like the most
reasonable way to manage self-learning meta rules.


On 03.03.09 16:43, Marc Perkel wrote:
It seems to me that the consensus is that it's worth a try. I don't  
know
if it will work or not but I think there's a good change this could  
be a

significant advancement in how well SA works.


I should note that some policy rules and rules with manually updated  
scores

(SPF_PASS, BAYES_*) may need to be exempted from this.
We don't want SPF_PASS to generate high positive score, do we?


You're still thinking linearly.  It's not a matter of giving SPF_PASS  
a high score, it's a matter of looking at ham and noticing that  
SPF_PASS and Bayes_00 and RANDTEST_08 have a ham index of 4% and a  
spam index of 0.02% and modifying mail that hits those three test with  
a +2.0 to the overall score.


It seems to me that from a programming standpoint, not much needs to  
be done.  All we need is a second bayes instance that ONLY looks at  
the X-Spam-Status line and a news status line X-Spam-Passed which SA  
gets patched to add as well.  Let bayes run over a large corpus of  
mail looking at those two headers and see if it's useful.


--
My mind is going. There is no question about it. I can feel it. I can
feel it. I can feel it. I'm... afraid.



Re: Dealing with low scoring spam - tighter MTA integration

2009-03-04 Thread John Hardin

On Wed, 4 Mar 2009, Andrzej Adam Filip wrote:


This would be an entirely different application, not SA, wouldn't it?


It can be developed using the same spam score logic, based subset of
all tests requiring only the subset of final data available during
classic run.


So in other words something like SMTP-time DNSBL tests that score points 
towards rejection rather than being pass/fail? That sounds like a good 
idea.



I do think that promoting tools that encourage postmaster to care very
much about mail server (IP address) reputation can make real difference
e.g. caring to be above reputation none in DNSWL to avoid grey-listing.


Agreed. But, performing major redesign of SA to achieve this pre-RCPT is 
going to be a tough sell.



Well, this probably could be done in SA using a multi-level protocol
capable of returning values at different stages. However, this seems
perfectly suited for a lightweight tool, rather than a hog that is
designed to scan and process entire messages. :)


During initial tests/deployment *much* simpler implementation can be
used with recommended action based on spam score:

It would require redesign of 50_scores.cf structure.
 e.g. instead of
   score RCVD_IN_DNSWL_HI 0 -8 0 -8
 something like that
   # N - Network, B - Bayes, nX - no X, R - RCPT TO:
   score RCVD_IN_DNSWL_HI nNnB=0 NnB=-8 nNB=0 NB=-8 R=-8
 or shorter
   score RCVD_IN_DNSWL_HI N=-8 R=-8


Why would SA be served by _major_ modifications like this, rather than 
writing a new milter that focuses on determining the reputation of an IP? 
Are you really willing to break _all_ existing SA installations for this?


Please don't try to make SA a do everything tool, you'll likely weaken 
what it does an outstanding job of today.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  Failure to plan ahead on someone else's part does not constitute
  an emergency on my part. -- David W. Barts in a.s.r
---
 4 days until Daylight Saving Time begins in U.S. - Spring Forward


Re: Dealing with low scoring spam - tighter MTA integration

2009-03-04 Thread Andrzej Adam Filip
John Hardin jhar...@impsec.org wrote:

 On Wed, 4 Mar 2009, Andrzej Adam Filip wrote:

 This would be an entirely different application, not SA, wouldn't it?

 It can be developed using the same spam score logic, based subset of
 all tests requiring only the subset of final data available during
 classic run.

 So in other words something like SMTP-time DNSBL tests that score
 points towards rejection rather than being pass/fail? That sounds like
 a good idea.

 I do think that promoting tools that encourage postmaster to care very
 much about mail server (IP address) reputation can make real difference
 e.g. caring to be above reputation none in DNSWL to avoid grey-listing.

 Agreed. But, performing major redesign of SA to achieve this pre-RCPT
 is going to be a tough sell.

 Well, this probably could be done in SA using a multi-level protocol
 capable of returning values at different stages. However, this seems
 perfectly suited for a lightweight tool, rather than a hog that is
 designed to scan and process entire messages. :)

 During initial tests/deployment *much* simpler implementation can be
 used with recommended action based on spam score:

 It would require redesign of 50_scores.cf structure.
  e.g. instead of
score RCVD_IN_DNSWL_HI 0 -8 0 -8
  something like that
# N - Network, B - Bayes, nX - no X, R - RCPT TO:
score RCVD_IN_DNSWL_HI nNnB=0 NnB=-8 nNB=0 NB=-8 R=-8
  or shorter
score RCVD_IN_DNSWL_HI N=-8 R=-8

 Why would SA be served by _major_ modifications like this, rather than
 writing a new milter that focuses on determining the reputation of an
 IP? Are you really willing to break _all_ existing SA installations
 for this?

 Please don't try to make SA a do everything tool, you'll likely
 weaken what it does an outstanding job of today.

0) Such _major_ modification means introducing it in next _major_
   spamassassin release unless it can be made downward compatible
   e.g. by using *separate* score file for at RCPT TO: tests.

   Where there's a Will, there's a way

1) I want milter(s) (MIMEDefang's filtering script in perl) to use
   spamassassin in such role. I personally prefer such tools from teams
   with well established maintenance reputation. I also believe that
   SA score tuning methodology would fit very well too.
2) Anyway limiting scores to *only* four cases *SHOULD NOT* stay forever.

-- 
[plen: Andrew] Andrzej Adam Filip : a...@onet.eu
All the people are so happy now, their heads are caving in.
I'm glad they are a snowman with protective rubber skin
  -- They Might Be Giants


Re: Bye Bye Bayes

2009-03-04 Thread LuKreme

On 4-Mar-2009, at 07:06, John Hardin wrote:

On Tue, 3 Mar 2009, LuKreme wrote:

On Mar 3, 2009, at 17:07, John Hardin jhar...@impsec.org wrote:

On Tue, 3 Mar 2009, LuKreme wrote:
 I am considering the following:
  Autolearn read mail in the inbox as ham
 Autolearn mail in .Junk and .SPAM as spam
  This is pretty east with maildir.
How is that different from using the built-in autolearning based  
on message score?


How is it the same? Already read messages in inbox means the user  
has accepted those messages without trashing them or junking them.


Sorry, I didn't register that part. I thought it was just messages  
in the inbox.


Bear in mind some mail clients will mark a message read if you  
only highlight the title line. Auto-preview can be annoying that way  
sometimes.


Yep, and I think THAT has caused me to decide against doing this.   
Instead I have changed to thinking about  having it autolearn as ham  
messages that are read and are NOT in .Junk* .SPAM* /cur /new   
or .Trash* -- but again, just mulling it over.


.Junk means the user, or the user's MUA, has flagged a message that  
is not tagged as spam.


Okay, I was assuming that was your SA spam quarantine, not your  
equivalent of the user's spam training folder.


I believe both the mozilla email programs (Tbird, Netscrape, Postbox)  
and Apple Mail.app use Junk for messages the MUAs think are  
spammish, not sure about any other clients.  Our SA spam quarantine  
is .SPAM


False junk would get pulled out of .Junk into the inbox and  
relearned as ham.


Haven't done it, still mulling.


Now that you've explained it in more detail it sounds better.


Better, but not good, perhaps.  I've half a mind to simply forget auto- 
learning for the virtual users completely and make them use sa-ham sa- 
spam to manually train, and if they don't?  Yeah, too bad.  OTOH, I'm  
a little tired and cranky today...


--
But just because you've seen me on your TV Doesn't mean I'm any
more enlightened than you



Please remove li...@billmerriam.com

2009-03-04 Thread Kai Schaetzl
Could the list-admin please remove li...@billmerriam.com from the list as 
it bounces all messages back to the original sender? Thanks!

(BTW: there's no list-admin address published anywhere.)

Kai

-- 
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com





SpamAssassin Doesn't Appear to be working

2009-03-04 Thread JasonHirsh

I have a freebsd 7.0 RC3  server running postfix amavisd-new clamavd and
SpamAssassin...  Having just upgraded ports I believe they are all current
releases

In this set up I am lead top believe that Amavisd-new handles the SA config 
but I did not see a process for spamd so i enabled in rc.conf.I am not
seeing any x-spam related headers in the long message header but the GTUBE
test message was discarded so it appears to be working

my amavisd config for SA  is

$sa_tag_level_deflt  = 2.0;  # add spam info headers if at, or above that
level
$sa_tag2_level_deflt = 6.31; # add 'spam detected' headers at that level
$sa_kill_level_deflt = 4.0; # triggers spam evasive actions
$sa_dsn_cutoff_level = 10;   # spam level beyond which a DSN is not sent
# $sa_quarantine_cutoff_level = 20; # spam level beyond which quarantine is
off
# $penpals_bonus_score = 5;  # (no effect without a @storage_sql_dsn
database)
# $penpals_threshold_high = $sa_kill_level_deflt; # don't waste time on hi
spam

$sa_mail_body_size_limit = 400*1024; # don't waste time on SA if mail is
larger
$sa_local_tests_only = 0;# only tests which do not require internet
access?
$sa_spam_subject_tag = '***SPAM*** ';
# SpamAssassin settings


@bypass_banned_checks_maps = (
['.sanddollarbonaire.com','habitatbonaire.com',
'.constantcontact.com','.att.net'] );
# $sa_local_tests_only is passed to Mail::SpamAssassin::new as a value
# of the option local_tests_only. See Mail::SpamAssassin man page.
# If set to 1, no SA tests that require internet access will be performed.
#
$sa_local_tests_only = 0;   # only tests which do not require internet
access?
#$sa_auto_whitelist = 1;# turn on AWL in SA 2.63 or older (irrelevant
# for SA 3.0, its cf option is
use_auto_whitelist)

any thoughts as to what happened to the headers and should i even care?


thanks

jason




-- 
View this message in context: 
http://www.nabble.com/SpamAssassin-Doesn%27t-Appear-to-be-working-tp22341459p22341459.html
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.



Re: SpamAssassin Doesn't Appear to be working

2009-03-04 Thread Mark Martinec
Jason,

 I have a freebsd 7.0 RC3  server running postfix amavisd-new clamavd and
 SpamAssassin...  Having just upgraded ports I believe they are all current
 releases

 In this set up I am lead top believe that Amavisd-new handles the SA config
 but I did not see a process for spamd so i enabled in rc.conf.

There is no need for a spamd process in this setup - think of amavisd
proces as an equivalent of spamd (in that it calls a SpamAssassin library
of perl modules), but speaks a different protocol: amavisd speaks SMTP,
spamd speaks spamc/spamd protocol.

 I am not  seeing any x-spam related headers in the long message header
 but the GTUBE test message was discarded so it appears to be working
 any thoughts as to what happened to the headers and should i even care?

The most likely reason for absence of X-Spam-* header fields is that
the recipient was not considered local - check your setting of 
@local_domains_maps (or %local_domains or @local_domains_acl).
X-Spam-* header fields are not inserted for outbound mail (i.e. when
recipient is not considered local). Check the log (possibly at elevated
log level) to make sure.

  Mark


Re: More Google group messages

2009-03-04 Thread Private
Kai Schaetzl wrote:
 Albert E. Whale wrote on Sun, 01 Mar 2009 12:51:36 -0500:

   
 Our Email is Filtered using the SPAMZapper.  More than 50 Million
 connections filtered and still counting.  SPAMZapper stops Spam, Viruses,
 and other Malware before it reaches your PC.
 

 Doesn't this ultimate tool stop it?

 Kai

   
Thank Kai!

Yes, it does a VERY Good job of stopping them, but like everything else,
it requires some maintenance, which is what we provide for our customers. 

Thank you!

Our Email is Filtered using the SPAMZapper.  More than 50 Million connections 
filtered and still counting.  SPAMZapper stops Spam, Viruses, and other Malware 
before it reaches your PC.


Re: 2 + 2 != 4 - Spamassassin needs a new paradigm

2009-03-04 Thread decoder

decoder wrote:

Justin Mason wrote:


So you're volunteering to code it up, then? ;)


I was planning to do at least some brainstorming+experiements as to 
what learning methods would seem suitable and how well the method 
performs, whenever I have time again. Unless someone else did that 
already?




Ok, I did some short experiments: I've built an SVM classifier from a 
large mail corpus (8226 mails (5414 ham, 2812 spam)) and did a 5-fold 
cross validation. The resulting classifier has an accuracy of over 99%, 
so performs as good as the regular system. Now I applied this to a set 
of 202 False Negatives that I collected, and 69 of these are recognized 
as spam by the SVM. As a second test, I pulled 2707 mails from one of my 
other inboxes and applied the classifier, the accuracy was again over 
99% (and this is only ham).


From my point of view, the results show that this approach has 
potential. It is highly accurate with respect to the current system, but 
additionally outperformed it on several false negatives.



There are other advantages that this system has over the common system: 
It allows everybody to train the whole spamfilter (not only Bayes) to 
the kind of spam that one receives, i.e. it is more adaptive than the 
common system.



Any opinions on this are greatly welcome. Maybe we should try to come up 
with a proof of concept plugin for SA?



Best regards,


Chris


smime.p7s
Description: S/MIME Cryptographic Signature


Re: Dealing with low scoring spam - tighter MTA integration [was: 2 + 2 != 4 - Spamassassin needs a new paradigm]

2009-03-04 Thread SM

At 07:02 04-03-2009, Andrzej Adam Filip wrote:

May be spamassassin should create set of tests intended for use before
replying RCPT TO: in SMTP session?
[ test based on: sending IP address, envelope sender, envelope
recipient, and name in helo/ehlo ]


SpamAssassin processes the message and returns the result.  The way 
it is designed, it can be integrated in different environments as it 
is MTA agnostic.  The change you propose could be done by introducing 
a new command in the protocol to evaluate the envelope information only.


It would be easier to do all that through a milter as there is less 
overhead.  The downside is that you will get more false positives.


Regards,
-sm 



Re: Dealing with low scoring spam - tighter MTA integration [was: 2 + 2 != 4 - Spamassassin needs a new paradigm]

2009-03-04 Thread Kenneth Porter
--On Wednesday, March 04, 2009 4:02 PM +0100 Andrzej Adam Filip 
a...@onet.eu wrote:



May be spamassassin should create set of tests intended for use before
replying RCPT TO: in SMTP session?


Check out http://mimedefang.org/

MIMEDefang includes SA integration.




Re: 2 + 2 != 4 - Spamassassin needs a new paradigm

2009-03-04 Thread Marc Perkel



decoder wrote:

decoder wrote:

Justin Mason wrote:


So you're volunteering to code it up, then? ;)


I was planning to do at least some brainstorming+experiements as to 
what learning methods would seem suitable and how well the method 
performs, whenever I have time again. Unless someone else did that 
already?




Ok, I did some short experiments: I've built an SVM classifier from a 
large mail corpus (8226 mails (5414 ham, 2812 spam)) and did a 5-fold 
cross validation. The resulting classifier has an accuracy of over 
99%, so performs as good as the regular system. Now I applied this to 
a set of 202 False Negatives that I collected, and 69 of these are 
recognized as spam by the SVM. As a second test, I pulled 2707 mails 
from one of my other inboxes and applied the classifier, the accuracy 
was again over 99% (and this is only ham).


From my point of view, the results show that this approach has 
potential. It is highly accurate with respect to the current system, 
but additionally outperformed it on several false negatives.



There are other advantages that this system has over the common 
system: It allows everybody to train the whole spamfilter (not only 
Bayes) to the kind of spam that one receives, i.e. it is more adaptive 
than the common system.



Any opinions on this are greatly welcome. Maybe we should try to come 
up with a proof of concept plugin for SA?



Good work so far but sounds like you need to throw more data at it. Also 
even though you indicate over 99% accuracy can you break that down 
better? 99.9% is 10 times as accurate as 99%.


Also - when it identifies messages do the numbers on the spam scores go 
up and ham goes down? If so that makes it more solid and starves the 
middle. I'm encouraged that the initial results are good.


My feeling is that if this works that it will work better if we have 
more informational tokens. For example - is the from address a freemail 
address. Does the message contain a freemail address. By themselves 
these wouldn't score points. But spam coming from yahoo, hotmail, gmail, 
etc. is a different kind of spam than spam coming from spambots. Maybe 
country tokens from the received lines would be useful. Maybe names of 
banks in the message would be useful. For example Bank of America + 
Nigeria = spam.


I'm really glad you're junking on this. I think it will be a breakthrough.


Some of these tokens


Re: Dealing with low scoring spam - tighter MTA integration

2009-03-04 Thread Andrzej Adam Filip
Kenneth Porter sh...@sewingwitch.com wrote:

 --On Wednesday, March 04, 2009 4:02 PM +0100 Andrzej Adam Filip
 a...@onet.eu wrote:

 May be spamassassin should create set of tests intended for use before
 replying RCPT TO: in SMTP session?

 Check out http://mimedefang.org/

 MIMEDefang includes SA integration.

I know MIMEDefang and I use it on one installation.

What I would like to see is a option to make spam assassin to produce
weighted scores based on subset of all tests capable to work on subset
of the final data available *before* message headersbody are
transfered in SMTP session.

-- 
[plen: Andrew] Andrzej Adam Filip : a...@onet.eu
Treaties are like roses and young girls -- they last while they last.
  -- Charles DeGaulle