Re: Auto-learning ‘considered harmful’: not so much when rejecting spam?

2023-01-17 Thread Matus UHLAR - fantomas

On 1/17/2023 7:33 AM, David Bürgin wrote:

I have heard it said many times on this list that auto-learning is
discouraged, so I decided to finally look into disabling it.

But then I realised that I do have a use for auto-learning: In my setup,
I use a milter to reject certain spam (score > 10.0). Now, if I turn off
auto-learning I lose something. Because, as far as I understand the
default spam auto-learning threshold of 12.0 causes incoming
high-probability spam to be learned as spam, even though the message is
then rejected and not available locally later.

Is my understanding correct? Auto-learning of spam can be useful if spam
is rejected during the SMTP conversation but after it has been seen
– and learned – by SpamAssassin?


On 17.01.23 09:37, Kevin A. McGrail wrote:
The problem with auto learning I've seen is that it slowly spirals 
miscategorization errors.


mostly because there are no really useful indicators of hamminess, and if 
they are, spammers use them to spread their junk.


after long manual training beingocasionally spoiled by autolearn, 
I have manually selected all rules that have negative scores to noautolearn:


tflags  RCVD_IN_RP_CERTIFIEDnoautolearn net nice
tflags  RCVD_IN_VALIDITY_CERTIFIED  noautolearn net nice
tflags  RCVD_IN_RP_SAFE noautolearn net nice
tflags  RCVD_IN_VALIDITY_SAFE   noautolearn net nice
tflags  RCVD_IN_DNSWL_LOW   noautolearn net nice
tflags  RCVD_IN_DNSWL_MED   noautolearn net nice
tflags  RCVD_IN_DNSWL_HInoautolearn net nice
tflags  RCVD_IN_MSPIKE_H2   noautolearn net nice
tflags  RCVD_IN_MSPIKE_H3   noautolearn net nice
tflags  RCVD_IN_MSPIKE_H4   noautolearn net nice
tflags  RCVD_IN_MSPIKE_H5   noautolearn net nice
tflags  RCVD_IN_MSPIKE_WL   noautolearn net nice
tflags  RCVD_IN_IADB_DK noautolearn net nice
tflags  RCVD_IN_IADB_DOPTIN noautolearn net nice
tflags  RCVD_IN_IADB_LISTED noautolearn net nice
tflags  RCVD_IN_IADB_MI_CPR_MAT noautolearn net nice
tflags  RCVD_IN_IADB_ML_DOPTIN  noautolearn net nice
tflags  RCVD_IN_IADB_OPTIN  noautolearn net nice
tflags  RCVD_IN_IADB_OPTIN_GT50 noautolearn net nice
tflags  RCVD_IN_IADB_RDNS   noautolearn net nice
tflags  RCVD_IN_IADB_SENDERID   noautolearn net nice
tflags  RCVD_IN_IADB_SPFnoautolearn net nice
tflags  RCVD_IN_IADB_UT_CPR_MAT noautolearn net nice
tflags  RCVD_IN_IADB_VOUCHEDnoautolearn net nice
tflags  DKIMWL_WL_HIGH  noautolearn net nice
tflags  DKIMWL_WL_MEDHI noautolearn net nice
tflags  DKIMWL_WL_MED   noautolearn net nice
tflags  DKIM_VALID  noautolearn net nice
tflags  DKIM_VALID_EF   noautolearn net nice

still needs some training.

and, in some places, you may need to dump the database and re-train from 
scratch.
That's why manual training is great and why you need to keep some spam, but 
mostly ham.



The technical term is that it reinforces a 
bias.  A Bayes database should be carefully maintained.  It's not very 
much of a fire and forget technology.


And, for example, letting user's control it becomes a question of 
"what is spam?"  For example, users might get a very legit mail BUT 
they are tired of seeing it in their inbox.  So they want to train it 
as spam.  If you have per-user implementations, that can be good BUT 
you need a few hundred samples of good email and bad email to activate 
Bayes.


In short, I don't have a good solution for training Bayes that isn't a 
lot of work but auto-learning is usually a bad solution.


One case where it might be good is if you had a system setup that you 
fed emails to that were classified.  It would then use that good feed 
to use the auto-learning and add a way of learning without using the 
command line.


--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
It's now safe to throw off your computer.


Re: Auto-learning ‘considered harmful’: not so much when rejecting spam?

2023-01-17 Thread Kevin A. McGrail

On 1/17/2023 7:33 AM, David Bürgin wrote:

I have heard it said many times on this list that auto-learning is
discouraged, so I decided to finally look into disabling it.

But then I realised that I do have a use for auto-learning: In my setup,
I use a milter to reject certain spam (score > 10.0). Now, if I turn off
auto-learning I lose something. Because, as far as I understand the
default spam auto-learning threshold of 12.0 causes incoming
high-probability spam to be learned as spam, even though the message is
then rejected and not available locally later.

Is my understanding correct? Auto-learning of spam can be useful if spam
is rejected during the SMTP conversation but after it has been seen
– and learned – by SpamAssassin?


The problem with auto learning I've seen is that it slowly spirals 
miscategorization errors.  The technical term is that it reinforces a 
bias.  A Bayes database should be carefully maintained.  It's not very 
much of a fire and forget technology.


And, for example, letting user's control it becomes a question of "what 
is spam?"  For example, users might get a very legit mail BUT they are 
tired of seeing it in their inbox.  So they want to train it as spam.  
If you have per-user implementations, that can be good BUT you need a 
few hundred samples of good email and bad email to activate Bayes.


In short, I don't have a good solution for training Bayes that isn't a 
lot of work but auto-learning is usually a bad solution.


One case where it might be good is if you had a system setup that you 
fed emails to that were classified.  It would then use that good feed to 
use the auto-learning and add a way of learning without using the 
command line.


Regards,
KAM

--
Kevin A. McGrail
kmcgr...@apache.org

Member, Apache Software Foundation
Chair Emeritus Apache SpamAssassin Project
https://www.linkedin.com/in/kmcgrail - 703.798.0171



Auto-learning ‘considered harmful’: not so much when rejecting spam?

2023-01-17 Thread David Bürgin
I have heard it said many times on this list that auto-learning is
discouraged, so I decided to finally look into disabling it.

But then I realised that I do have a use for auto-learning: In my setup,
I use a milter to reject certain spam (score > 10.0). Now, if I turn off
auto-learning I lose something. Because, as far as I understand the
default spam auto-learning threshold of 12.0 causes incoming
high-probability spam to be learned as spam, even though the message is
then rejected and not available locally later.

Is my understanding correct? Auto-learning of spam can be useful if spam
is rejected during the SMTP conversation but after it has been seen
– and learned – by SpamAssassin?


Re: Question regarding auto-learning

2018-07-04 Thread Matus UHLAR - fantomas

On 03.07.18 12:17, J Doe wrote:

From reading the documentation, it appears that to train the Bayesian
filter I require a minimum of 1,000 pieces of ham and 1,000 pieces of
spam.


no. You need at least 200 hams and spams for bayes to start firing but you
can tune it bu setting bayes_min_ham_num and bayes_min_spam_num.

note that too few mails trained can result in false positives/negatives.


I am currently collecting spam on one of my servers via a spam trap
address and slowly reaching that number.  I was wondering, though, if I
can use auto learning (bayes_auto_learn 1), before training the database ?


autolearning does training instead of you. manual training is still faster
and more precise.


When autolearn fires on messages at the moment, it is correctly detecting
ham and spam based on the default ham and spam thresholds:

   bayes_auto_learn_threshold_nonspam 0.1
   bayes_auto_learn_threshold_spam 12.0

Can this be used before training the database or is it more often used to
supplement (on an ongoing basis), a database that has already be trained ?


those don't contradict each other.
you can use manual and automatic learning both.

--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Chernobyl was an Windows 95 beta test site.


Question regarding auto-learning

2018-07-03 Thread J Doe
Hello,

I have a question regarding autolearning and Bayes functionality.

From reading the documentation, it appears that to train the Bayesian filter I 
require a minimum of 1,000 pieces of ham and 1,000 pieces of spam.  I am 
currently collecting spam on one of my servers via a spam trap address and 
slowly reaching that number.  I was wondering, though, if I can use auto 
learning (bayes_auto_learn 1), before training the database ?

When autolearn fires on messages at the moment, it is correctly detecting ham 
and spam based on the default ham and spam thresholds:

bayes_auto_learn_threshold_nonspam 0.1
bayes_auto_learn_threshold_spam 12.0

Can this be used before training the database or is it more often used to 
supplement (on an ongoing basis), a database that has already be trained ?

Thanks,

- J




Re: Bayes not auto-learning?

2018-02-24 Thread David Jones

On 02/24/2018 01:05 AM, Amir Caspi wrote:

On Feb 23, 2018, at 11:47 PM, David B Funk  wrote:

It could have 20 points from a whole bunch of body rules but if it only hit 2
points via header rules it still will not auto-learn.


Gotcha. The spam in question that triggered this hit a lot of rules, but hard 
for me to tell on cursory inspection whether it satisfies sufficient header and 
body points.  But it LOOKS like there should be at least 3 points from header 
(MISSING_HEADERS, FREEMAIL_FORGED_REPLYTO, among others) and certainly 3 body 
(MONEY_FRAUD_3 at the very least).  The actual spam report is this:

*  0.0 FSL_CTYPE_WIN1251 Content-Type only seen in 419 spam
*  0.0 NSL_RCVD_FROM_USER Received from User
*  1.0 MISSING_HEADERS Missing To: header
*  0.8 BAYES_50 BODY: Bayes spam probability is 40 to 60%
*  [score: 0.5004]
*  1.1 DCC_CHECK Detected as bulk mail by DCC (dcc-servers.net)
*  0.0 FROM_MISSP_MSFT From misspaced + supposed Microsoft tool
*  0.0 FSL_NEW_HELO_USER Spam's using Helo and User
*  2.6 MSOE_MID_WRONG_CASE No description available.
*  0.0 FROM_MISSP_USER From misspaced, from "User"
*  1.0 RDNS_DYNAMIC Delivered to internal network by host with
*  dynamic-looking rDNS
*  0.0 LOTS_OF_MONEY Huge... sums of money
*  0.0 FROM_MISSP_XPRIO Misspaced FROM + X-Priority
*  1.6 REPLYTO_WITHOUT_TO_CC No description available.
*  0.0 AXB_XMAILER_MIMEOLE_OL_024C2 Yet another X header trait
*  0.0 MSGID_FROM_MTA_HEADER Message-Id was added by a relay
*  0.0 FSL_BULK_SIG Bulk signature with no Unsubscribe
*  2.1 FREEMAIL_FORGED_REPLYTO Freemail in Reply-To, but not From
*  1.0 FREEMAIL_REPLYTO Reply-To/From or Reply-To/body contain different
*  freemails
*  0.0 TO_NO_BRKTS_FROM_MSSP Multiple header formatting problems
*  1.9 FORGED_MUA_OUTLOOK Forged mail pretending to be from MS Outlook
*  1.6 TO_NO_BRKTS_DYNIP To: lacks brackets and dynamic rDNS
*  0.0 FILL_THIS_FORM Fill in a form with personal information
*  2.0 TO_NO_BRKTS_MSFT To: lacks brackets and supposed Microsoft tool
*  2.0 FILL_THIS_FORM_LONG Fill in a form with personal information
*  3.1 FROM_MISSP_FREEMAIL From misspaced + freemail provider
*  3.0 MONEY_FRAUD_3 Lots of money and several fraud phrases

But, it still didn't autolearn.

(I can post the entire spample if the above seems like it should have 
autolearned.)


Another possible factor, if you have "bayes_auto_learn_on_error" enabled, then 
autolearn will be skipped if Bayes already agrees with the condition of the message. IE: 
if the message is already classifed as BAYES_99 then it won't bother auto-learning it as 
yet another high-ranking spam.


I do not have that enabled.  Also, as you can see from above, this hit BAYES_50.

Does the above provide an indication as to why it didn't autolearn?

Thanks!

--- Amir




I found the best thing to do is setup a hidden mail server (iRedMail) 
and split a copy of all mail to it to sort and filter into a Ham and 
Spam folder based on rule hits and scoring.  Then I run a nightly 
sa-learn on the Ham and Spam folders (in that order).  The few 
questionable emails that score in the middle stay in the Inbox so I just 
have to drag-n-drop into the Ham or Spam folder taking a few minutes a 
day.  Some that are new phishing campaigns or are from compromised 
accounts are copied into a Spamcop folder that automatically submits it 
to my Spamcop account.


I also use the Ham and Spam folders for the nightly SA masscheck to help 
get new rules validated and new 72_scores.cf update daily via sa-update.


--
David Jones


Re: Bayes not auto-learning?

2018-02-24 Thread Kevin A. McGrail

On 2/24/2018 2:05 AM, Amir Caspi wrote:

Does the above provide an indication as to why it didn't autolearn?


No, the above does not help as the autolearning is complicated. I 
believe a few years ago I added debug output or headers or something 
that tried to make it clearer.  If it doesn't autolearn, I would not 
stress.  It's not a simplistic, black or white decision based on a 
single factor.


Off-hand, I can't find the work I did but 
$status->get_autolearn_points() might help you dig into the code.


Regards,

KAM



Re: Bayes not auto-learning?

2018-02-23 Thread Amir Caspi
On Feb 23, 2018, at 11:47 PM, David B Funk  wrote:
> It could have 20 points from a whole bunch of body rules but if it only hit 2
> points via header rules it still will not auto-learn.

Gotcha. The spam in question that triggered this hit a lot of rules, but hard 
for me to tell on cursory inspection whether it satisfies sufficient header and 
body points.  But it LOOKS like there should be at least 3 points from header 
(MISSING_HEADERS, FREEMAIL_FORGED_REPLYTO, among others) and certainly 3 body 
(MONEY_FRAUD_3 at the very least).  The actual spam report is this:

*  0.0 FSL_CTYPE_WIN1251 Content-Type only seen in 419 spam
*  0.0 NSL_RCVD_FROM_USER Received from User
*  1.0 MISSING_HEADERS Missing To: header
*  0.8 BAYES_50 BODY: Bayes spam probability is 40 to 60%
*  [score: 0.5004]
*  1.1 DCC_CHECK Detected as bulk mail by DCC (dcc-servers.net)
*  0.0 FROM_MISSP_MSFT From misspaced + supposed Microsoft tool
*  0.0 FSL_NEW_HELO_USER Spam's using Helo and User
*  2.6 MSOE_MID_WRONG_CASE No description available.
*  0.0 FROM_MISSP_USER From misspaced, from "User"
*  1.0 RDNS_DYNAMIC Delivered to internal network by host with
*  dynamic-looking rDNS
*  0.0 LOTS_OF_MONEY Huge... sums of money
*  0.0 FROM_MISSP_XPRIO Misspaced FROM + X-Priority
*  1.6 REPLYTO_WITHOUT_TO_CC No description available.
*  0.0 AXB_XMAILER_MIMEOLE_OL_024C2 Yet another X header trait
*  0.0 MSGID_FROM_MTA_HEADER Message-Id was added by a relay
*  0.0 FSL_BULK_SIG Bulk signature with no Unsubscribe
*  2.1 FREEMAIL_FORGED_REPLYTO Freemail in Reply-To, but not From
*  1.0 FREEMAIL_REPLYTO Reply-To/From or Reply-To/body contain different
*  freemails
*  0.0 TO_NO_BRKTS_FROM_MSSP Multiple header formatting problems
*  1.9 FORGED_MUA_OUTLOOK Forged mail pretending to be from MS Outlook
*  1.6 TO_NO_BRKTS_DYNIP To: lacks brackets and dynamic rDNS
*  0.0 FILL_THIS_FORM Fill in a form with personal information
*  2.0 TO_NO_BRKTS_MSFT To: lacks brackets and supposed Microsoft tool
*  2.0 FILL_THIS_FORM_LONG Fill in a form with personal information
*  3.1 FROM_MISSP_FREEMAIL From misspaced + freemail provider
*  3.0 MONEY_FRAUD_3 Lots of money and several fraud phrases

But, it still didn't autolearn.

(I can post the entire spample if the above seems like it should have 
autolearned.)

> Another possible factor, if you have "bayes_auto_learn_on_error" enabled, 
> then autolearn will be skipped if Bayes already agrees with the condition of 
> the message. IE: if the message is already classifed as BAYES_99 then it 
> won't bother auto-learning it as yet another high-ranking spam.

I do not have that enabled.  Also, as you can see from above, this hit BAYES_50.

Does the above provide an indication as to why it didn't autolearn?

Thanks!

--- Amir




Re: Bayes not auto-learning?

2018-02-23 Thread Ian Zimmerman
On 2018-02-23 22:32, Amir Caspi wrote:

> So, I've been trying to tweak my setup and noticed that VERY few of my
> emails are being autolearned as spam, even when their spam threshold
> is far above the autolearn threshold.  The threshold is set to 12; I
> just saw a spam with score >25 not being autolearned.

Sigh.  This really is a FAQ, and I did ask it myself (maybe more than
once).

Read the fine documentation.  Shortned: the score that is compared to
the threshold for autolearning is _not_ the normal score that determines
spam/ham.

Despite the fact that is is documented, I find the algorithm to be too
opaque to feel in control.

-- 
Please don't Cc: me privately on mailing lists and Usenet,
if you also post the followup to the list or newsgroup.
To reply privately _only_ on Usenet and on broken lists
which rewrite From, fetch the TXT record for no-use.mooo.com.


Re: Bayes not auto-learning?

2018-02-23 Thread David B Funk

On Fri, 23 Feb 2018, Amir Caspi wrote:


Hi all,

So, I've been trying to tweak my setup and noticed that VERY few of my 
emails are being autolearned as spam, even when their spam threshold is far above 
the autolearn threshold.  The threshold is set to 12; I just saw a spam with score 
>25 not being autolearned.

Are there rules that prevent autolearning?  If so, why?  If a spam 
scores really high because it hits (let's say) 10 or more rules, but just one 
of those rules is enough to prevent autolearning, that seems overly 
restrictive, no?

For example, for one of my users, out of about 650 spams received in 
the last month, only 10 have been autolearned.  For another user, only 12 of 
nearly 1400.  That seems like a very low percentage, and clearly some 
high-scoring spams are not being auto-learned.

Any explanation is appreciated!

Thanks!

--- Amir


If you read the spamassassin documentation about Bayes auto-learning you will 
see that there are several conditions that must be satisfied.


For example, there are some types of rules which aren't considered at all when 
computing the auto-learning threshold score (such as white/black list scores or 
rules tagged with the noautolearn tflag or the actual Bayes score itself).


Of the types of rules which are allowed, at least 3 of those points must come 
from header type rules and at least 3 of those points must come from body type 
rules.


So a spam can have 100 points from a blacklist and not auto-learn.

It could have 20 points from a whole bunch of body rules but if it only hit 2
points via header rules it still will not auto-learn.

Another possible factor, if you have "bayes_auto_learn_on_error" enabled, then 
autolearn will be skipped if Bayes already agrees with the condition of the 
message. IE: if the message is already classifed as BAYES_99 then it won't 
bother auto-learning it as yet another high-ranking spam.


What I usually see in auto-learned spam is that they hit a number of network RBL 
rules (spamhaus, SORBS, etc) and a number of body rules such as RAZOR, URIBLS, 
etc.



--
Dave Funk  University of Iowa
College of Engineering
319/335-5751   FAX: 319/384-0549   1256 Seamans Center
Sys_admin/Postmaster/cell_adminIowa City, IA 52242-1527
#include 
Better is not better, 'standard' is better. B{


Bayes not auto-learning?

2018-02-23 Thread Amir Caspi
Hi all,

So, I've been trying to tweak my setup and noticed that VERY few of my 
emails are being autolearned as spam, even when their spam threshold is far 
above the autolearn threshold.  The threshold is set to 12; I just saw a spam 
with score >25 not being autolearned.

Are there rules that prevent autolearning?  If so, why?  If a spam 
scores really high because it hits (let's say) 10 or more rules, but just one 
of those rules is enough to prevent autolearning, that seems overly 
restrictive, no?

For example, for one of my users, out of about 650 spams received in 
the last month, only 10 have been autolearned.  For another user, only 12 of 
nearly 1400.  That seems like a very low percentage, and clearly some 
high-scoring spams are not being auto-learned.

Any explanation is appreciated!

Thanks!

--- Amir



Re: Bayes, Manual and Auto Learning Strategies

2014-07-02 Thread Steve Bergman



On 07/02/2014 11:12 AM, John Hardin wrote:


A week or so back they briefly listed some of the MailControl.com MTAs,
due to apparent exploits. They were quickly removed, though.


So the message here is that some DNSBL's are better than others about 
including and removing addresses quickly and responsibly. Perhaps. I 
take no position on that.


But that does not address the issue of collateral damage to users which 
share an ISP's email server with someone else who happened to get a spam 
through and reported back to the DNSBL.


Not long ago, I had another client blocked from sending response emails 
to their on-line customers about their purchases. Turned out one of the 
users on the hosting provider's system had sent some spam. Now the 
hosting provider (Webfaction) is quite responsible, very diligent, and 
has *fantastic* support. (I can recommend them for dynamic language 
language apps with no reservations.) But guess what? The DNSBL's 
interface for interacting with them was down. For over a week. (We're 
sorry, but... Please come back when... No guaranty as to...) And emails 
to the affected customers were blocked for all that time.


I use DNSBL's. But I don't like them. SA is indispensable. I like it. 
But it's a huge compilation of kluges that happen to mostly work.


Expedient. Pragmatic. Not a real solution to the actual problem.

-Steve



Re: Bayes, Manual and Auto Learning Strategies

2014-07-02 Thread Steve Bergman



On 07/02/2014 11:10 AM, Jim Popovitch wrote:


Just a heads-up... that sort of biting comment is probably not welcome


I'm familiar with adapting to the relative insularities of various 
lists. But thanks for the head-up, Jim.


-Steve


Re: Bayes, Manual and Auto Learning Strategies

2014-07-02 Thread John Hardin

On Wed, 2 Jul 2014, Axb wrote:

If a sender's IP is listed @Spamhaus , he has a serious problem reaching 
many, many destinations. If he's been expoited, you get good evidence and 
fast delisting processsing and I have yet to see a real FP with ZEN.


A week or so back they briefly listed some of the MailControl.com MTAs, 
due to apparent exploits. They were quickly removed, though.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  There is no better measure of the unthinking contempt of the
  environmentalist movement for civilization than their call to
  turn off the lights and sit in the dark.-- Sultan Knish
---
 2 days until the 238th anniversary of the Declaration of Independence


Re: Bayes, Manual and Auto Learning Strategies

2014-07-02 Thread Jim Popovitch
On Wed, Jul 2, 2014 at 11:54 AM, Steve Bergman  wrote:
>> I suggest you join the SDLU list where you can discuss anti spam
>> philosophy.
>>
>
> Thanks. I suggest that you consult for an ISP-dependent business someday.
> ;-)
>
> It's an education, too.
>
> -Steve


Just a heads-up... that sort of biting comment is probably not welcome
on the SDLU list.

-Jim P.


Re: Bayes, Manual and Auto Learning Strategies

2014-07-02 Thread Steve Bergman

I suggest you join the SDLU list where you can discuss anti spam
philosophy.



Thanks. I suggest that you consult for an ISP-dependent business 
someday. ;-)


It's an education, too.

-Steve


Re: Bayes, Manual and Auto Learning Strategies

2014-07-02 Thread Axb

On 07/02/2014 05:39 PM, Steve Bergman wrote:

On 07/02/2014 09:48 AM, Axb wrote:


If an IP is exploited/sends spam and a legitimate msg is rejected then
somebody hasn't done due diligence and I see the reject as legitimated.



The legitimate senders and receivers of the good message, neither of
whom's companies have anything to do with the spam, would not see it
that way. And I agree with their perspective. Some of the perspective
I'm reading here seem really off in the ether. I get the impression that
some are so frustrated with SA's limitations that they are willing to
resort to desperate measures which normal users would instantly
recognize as insane.

No rudeness intended. But some of the things I'm reading here are just
bizarre.


I suggest you join the SDLU list where you can discuss anti spam 
philosophy.


It's a great resource for knowledge.

List Guidelines: http://www.new-spam-l.com/admin/faq.html
List Information: https://spammers.dontlike.us/mailman/listinfo/list

The Mailop list is also a good place to lurk and bathe in hundreds of 
years of mail related experience


http://chilli.nosignal.org/mailman/listinfo/mailop





Re: Bayes, Manual and Auto Learning Strategies

2014-07-02 Thread Steve Bergman

On 07/02/2014 09:48 AM, Axb wrote:


If an IP is exploited/sends spam and a legitimate msg is rejected then
somebody hasn't done due diligence and I see the reject as legitimated.



The legitimate senders and receivers of the good message, neither of 
whom's companies have anything to do with the spam, would not see it 
that way. And I agree with their perspective. Some of the perspective 
I'm reading here seem really off in the ether. I get the impression that 
some are so frustrated with SA's limitations that they are willing to 
resort to desperate measures which normal users would instantly 
recognize as insane.


No rudeness intended. But some of the things I'm reading here are just 
bizarre.


-Steve


Re: Bayes, Manual and Auto Learning Strategies

2014-07-02 Thread Axb

On 07/02/2014 04:40 PM, Steve Bergman wrote:


You are discussing about DNSBLs but not being specific.



I'm specific in that all the DNSBL's blacklist IP addresses or blocks.
And that in today's world many, many companies share sets of mail
servers with many other companies and individuals.


If an IP is exploited/sends spam and a legitimate msg is rejected then 
somebody hasn't done due diligence and I see the reject as legitimated.


If I need to open up, I have options as the DNSWL, etc.








Re: Bayes, Manual and Auto Learning Strategies

2014-07-02 Thread Steve Bergman




You are discussing about DNSBLs but not being specific.



I'm specific in that all the DNSBL's blacklist IP addresses or blocks. 
And that in today's world many, many companies share sets of mail 
servers with many other companies and individuals.



I'll let others sell you this Hoover.


No sale necessary. I continue to recognize the overall expediency of the 
DNSBL kluge, and continue to use it myself.


I wouldn't buy a Hoover anyway. I'm a Kirby kind of guy. I have a 1969 
Dual Sanitronic 80 that my grandmother gave our family new, as a 
Christmas gift.


https://c1.staticflickr.com/7/6071/6056367963_f06f08c7f6_z.jpg

A 1976 Classic III that I picked up at a garage sale.

http://cdn3.volusion.com/maxg3.xen6j/v/vspfiles/photos/KirbyClassicIII-4.jpg?1329982229

And a really cool model 516, manufactured in 1956 that someone had set 
out on the curb for garbage pickup, which I rescued and restored.


http://www.1377731.com/kirby/516_5.jpg

All stock photos. Not mine.


-Steve


Re: Bayes, Manual and Auto Learning Strategies

2014-07-02 Thread Axb

On 07/02/2014 03:54 PM, Steve Bergman wrote:



On 07/02/2014 06:45 AM, Axb wrote:


I'm pretty sure, a huge amount of SA users trust Spamhaus' ZEN at smtp
level for outright rejects.


At this point, I'm using the defaults, other than upping BAYES_999
enough to enough to total to 5.0 when added to BAYES_99.



If a sender's IP is listed @Spamhaus , he has a serious problem reaching
many, many destinations.


Many, many destinations? Or a high percentage of destinations? I
recently had to explain to the owner of the company why an important
email from one of his business associates at another company was
blocked. I told him that they were on a couple of spam block lists
(which they were) and that contributed to the mail's rejection.

I made the same pitch. "This should affect their outgoing mail to many
sites, etc.". But I'm not sure I believe it. When I interact with people
who've had their emails rejected (often related to DNSBLs) I've been
listening for any mention of other mails of theirs to other companies
being blocked. But when the DNSBL rules in SA are the major contributors
to the rejecting, it seems that we are the only domain they interact
with which is doing so. Entries in the DNSBL databases do great
collateral damage.

And of course none of these companies are spammers. They're with this or
that ISP who has, at one time, had someone exploit their servers to send
spam.

DNSBL's are like a guy with a bazooka trying to play sniper.



You are discussing about DNSBLs but not being specific.

With millions of sessions/day I'm glad Spamhaus keeps my servers from 
melting.


I'll let others sell you this Hoover.







Re: Bayes, Manual and Auto Learning Strategies

2014-07-02 Thread Steve Bergman



On 07/02/2014 06:45 AM, Axb wrote:


I'm pretty sure, a huge amount of SA users trust Spamhaus' ZEN at smtp
level for outright rejects.


At this point, I'm using the defaults, other than upping BAYES_999 
enough to enough to total to 5.0 when added to BAYES_99.




If a sender's IP is listed @Spamhaus , he has a serious problem reaching
many, many destinations.


Many, many destinations? Or a high percentage of destinations? I 
recently had to explain to the owner of the company why an important 
email from one of his business associates at another company was 
blocked. I told him that they were on a couple of spam block lists 
(which they were) and that contributed to the mail's rejection.


I made the same pitch. "This should affect their outgoing mail to many 
sites, etc.". But I'm not sure I believe it. When I interact with people 
who've had their emails rejected (often related to DNSBLs) I've been 
listening for any mention of other mails of theirs to other companies 
being blocked. But when the DNSBL rules in SA are the major contributors 
to the rejecting, it seems that we are the only domain they interact 
with which is doing so. Entries in the DNSBL databases do great 
collateral damage.


And of course none of these companies are spammers. They're with this or 
that ISP who has, at one time, had someone exploit their servers to send 
spam.


DNSBL's are like a guy with a bazooka trying to play sniper.

-Steve


Re: Bayes, Manual and Auto Learning Strategies

2014-07-02 Thread Axb

On 07/02/2014 10:47 AM, Steve Bergman wrote:


The DNSBL's are problematic because so many ISP's mail servers are on
them. We get quite a few emails from employees at companies who's ISP's
are on Spamhaus lists, or whatever, due to nothing that has anything to
do with them.


I'm pretty sure, a huge amount of SA users trust Spamhaus' ZEN at smtp 
level for outright rejects.


If a sender's IP is listed @Spamhaus , he has a serious problem reaching 
many, many destinations. If he's been expoited, you get good evidence 
and fast delisting processsing and I have yet to see a real FP with ZEN.


Consider it being better a sender gets a hard reject than having msgs 
land in some spam folder and remain unseen.


but then...


Re: Bayes, Manual and Auto Learning Strategies

2014-07-02 Thread Axb

On 07/02/2014 10:47 AM, Steve Bergman wrote:


But for all the discussion today, we never really had a good talk about
postscreen, which is something I'd like to hear someone expound a bit upon.


probably Wrong list ... review Postfix list archives


Re: Bayes, Manual and Auto Learning Strategies

2014-07-02 Thread Axb

On 07/02/2014 10:47 AM, Steve Bergman wrote:


I'll add you to the list of people telling me that jumping out of an
airplane at 20,000 feet with nothing but a parachute and a pair of
underwear is fun.


Yep... it is...
though you could catch a cold...


Re: Bayes, Manual and Auto Learning Strategies

2014-07-02 Thread Steve Bergman



On 07/02/2014 03:05 AM, Dave Funk wrote:


Unless you've explicitly disabled them, the network based rules (razor,
pyzor, dcc, DNS based rules, RBLs, URIBLs, etc) constitute an external
'reputation' system to pass judgment on messages.


Actually, DCC is not included in the default due to arbitrary 
restrictions on request volume for the public servers. 100,000 per day 
or something. And neither is Pyzor, presumably for similar reasons? 
Razor2 is in by default.


I use all these, but have reservations about them. DCC Pyzor and Razor2 
are lists of bulk email. Not specifically of *unsolicited* bulk email. 
Many of my users are on lists of various sorts.


The DNSBL's are problematic because so many ISP's mail servers are on 
them. We get quite a few emails from employees at companies who's ISP's 
are on Spamhaus lists, or whatever, due to nothing that has anything to 
do with them.




It's not uncommon to take a low-scoring spam and find that it gets a
higher score on retest as it has been added to various bad-boy lists.


Except that the "bad-boy" lists flag more ham then spam.



This is also one way that gray-listing helps.


Review the thread. You don't want to talk to me about greylisting. ;-)

But for all the discussion today, we never really had a good talk about 
postscreen, which is something I'd like to hear someone expound a bit upon.




I've used site-wide Bayes with auto-learning at a site with ~3000 users
and have had to flush & restart our Bayes database twice in 10 years.



I'll add you to the list of people telling me that jumping out of an 
airplane at 20,000 feet with nothing but a parachute and a pair of 
underwear is fun.


-Steve


Re: Bayes, Manual and Auto Learning Strategies

2014-07-02 Thread Steve Bergman



On 07/02/2014 02:39 AM, Dave Funk wrote:


Steve,
For some reason you seem to be hung-up on Bayes "autolearning".


Skip down the thread. I was demonstrated to be wrong. :-)



It it possible that you're confusing it with "Auto-White listing"? (which is now
deprecated and has -nothing- to do with Bayes).


No. I know the difference. AWL, planned to be replaced with TxRep and 
all that. (I'd mention that TxRep has problems, but it's too late at 
night for me to engage in yet another argument.)




SA's Bayesian scorer is a system based upon a method that parses a
message, extracts 'tokens' from it and uses an algorithm to calculate a
score for the message based upon a dictionary of previously seen tokens
and their relative merit.


Yeah. Bayesian statistics is pretty cool.


or via an automated process from within SA as it scores messages
(known as 'auto' learning). So regardless of whether manual or auto
learning is utilized, tokens are added to the dictionary.


See, that's where things stop making sense to me. I would not expect the 
Bayesian filter to do any better than it's training. And if it's 
training is via input from static rules (plus DNSBL's and DCC's) I would 
not expect it to be able to do any better. And it's not hard to imagine 
pathological behavior developing. But people are telling me different. 
And I'm open to considering alternative possibilities.



It's also
possible to employ both auto & manual learning methods in the same
installation.


That would be the scenario I am considering.


There can be one dictionary used for scoring all messages processed (called
"site wide Bayes") or many separate dictionaries, one used for each
recognized user ("per user Bayes"). Either way, the dictionary(s) need to
be updated (and the update process could be either manual, auto, or both).


Yes. I've been devoted to individual fileDB's, each individually trained 
for a particular user's spam^Wemail stream. People are telling me that 
system-wide databases work well.



It's been this way for the past 10+ years AFAIK (well, maybe 10 years
ago it didn't have as many options for back-end database storage, mostly
limited to Berkeley-DB type methods).



I think it was around 2003, in SA 2.5(?) that SA got a Bayesian 
classifier. IIRC, there was a project called dspam (which I think is 
still around) For a while the dspam guys were pushing the fact that 
*dspam* was a modern spam filter, and SA was old, clunky, and too 
outdated to use.


Anyway, in the very early versions of SA Bayes, everything was 
system-wide. Later they added the option to use individual user files. 
And the only info I've seen that described autolearn and how it worked 
was a mailing list post from 2004 which specifically stated that it was 
system-wide, in memory, and was lost upon restart. Maybe that's correct 
and maybe it's not.


But today, it looks to be user-specific, if configured that way. I'm 
still working out whether I want to use it, and if so, how.


-Steve


Re: Bayes, Manual and Auto Learning Strategies

2014-07-02 Thread Dave Funk

On Wed, 2 Jul 2014, Steve Bergman wrote:

Well... I just turned on autolearn for a moment, deleted the bayes_* files on 
the test account I use, and sent myself a message from my usual outside 
account. And new bayes_* files were created. So I was wrong, and I win. More 
options.


So now I can proceed to the "what does this mean?" phase.

If I leave things as they are, then training is perfect if the users are 
diligent. But if they are not, then... what? I see plenty of spams getting 
through with a 0.0 score. IIRC, the autolearn spam threshold is 7? Pretty 
much everything there is spam.


But I'm not sure I quite buy having the static rules of SA training Bayes. 
Isn't Bayes just learning to emulate the static rules, with all their 
imperfections?


Unless you've explicitly disabled them, the network based rules (razor,
pyzor, dcc, DNS based rules, RBLs, URIBLs, etc) constitute an external
'reputation' system to pass judgment on messages.
It's not uncommon to take a low-scoring spam and find that it gets a
higher score on retest as it has been added to various bad-boy lists.

This is also one way that gray-listing helps. If you stiff-arm the first
pass of a spam run a later check may hit it more accurately as it's been
added to block-lists in the mean-time.


If it starts going wrong, doesn't that mean the errors are going to spiral 
out of control?


That is a possible risk of relying solely on auto-learning.
The autolearn system has been carefully crafted and tuned over the years
to try to prevent a feed-back loop from throwing it into a tail-spin.
For example the internal scoring system used to determine if a message
is spam or ham WRT the choice for auto-learning explicitly excludes
the Bayes score (and other particular kinds of scores such as white/black
lists) to try to prevent tail-eating.
Occasional judicious manual learning can help to 'tweak' things when Bayes
looks like it's not in top shape. (IE manual learning of FPs & FNs).

I've used site-wide Bayes with auto-learning at a site with ~3000 users
and have had to flush & restart our Bayes database twice in 10 years.

Dave

--
Dave Funk  University of Iowa
College of Engineering
319/335-5751   FAX: 319/384-0549   1256 Seamans Center
Sys_admin/Postmaster/cell_adminIowa City, IA 52242-1527
#include 
Better is not better, 'standard' is better. B{


Re: Bayes, Manual and Auto Learning Strategies

2014-07-02 Thread Steve Bergman



On 07/02/2014 02:14 AM, Axb wrote:


YOu don't need to trust me or believe me (I'm not selling anything -
just commenting on what works for me)


Well, I know you know what I meant.


Ever thought of running a newer distro in a VM, only for SA and let
spamass-milter use that?
That would mean you can play with SA 3.4 without having to redo all your
mail infra?



I'm pushing to do our ubuntu 14.04 upgrade soon to get the dovecot full 
text search. And then a memory upgrade. And these days I just max them 
out on memory. 4GB -> 32GB. Plus adding a 4TB RAID1.


So it ought to be able to handle almost anything. And I've just 
confirmed that SA 3.4 made it into 14.04.


That should, at least, avert all those annoying "time to upgrade" 
responses like I got here earlier.


It's very late here. 2:45AM, I see. But it's been a lot of fun arguing 
with you guys today. And thanks for all the help. Pyzor seems to be 
functioning fine now.


General rules of thumb to keep in mind:

Whenever there are inexplicable problems, it's probably selinux causing 
them. And if not that, regular old POSIX permissions.


And if ever there is an article of clothing you need but can't find 
anywhere in the house, there's usually a dog sleeping on it. Or possibly 
a cat.


-Steve


Re: Bayes, Manual and Auto Learning Strategies

2014-07-02 Thread Dave Funk

On Wed, 2 Jul 2014, Steve Bergman wrote:


On 07/01/2014 11:49 PM, Karsten Bräckelmann wrote:


Those do not tell you about using file or SQL based databases?


They do. But not specifically with respect to autolearn.

You never

thought about googling for "spamassassin per user" and friends? You
never checked the SA wiki?


I have, indeed. No reference to autolearn and persistent storage. The lack of 
mention is notable.


I'd expect people to be lining up to tell me I'm mistaken if I absolutely 
were.


Can you point me to a change log somewhere documenting autolearn moving from 
in-memory and system-wide to per user and persistent?


I don't hold a strong opinion on this. It would be nice if I were wrong. It 
would open more options.


I'm just waiting for evidence that it's the case. My perception is that It's 
not.


-Steve


Steve,
For some reason you seem to be hung-up on Bayes "autolearning". It it
possible that you're confusing it with "Auto-White listing"? (which is now
deprecated and has -nothing- to do with Bayes).

SA's Bayesian scorer is a system based upon a method that parses a
message, extracts 'tokens' from it and uses an algorithm to calculate a
score for the message based upon a dictionary of previously seen tokens
and their relative merit.

The dictionary is created and updated by a process called 'learning'
wherein already-classified messages are tokenized and their tokens are
stored in the dictionary along with a merit value derived from their
instance count and a factor taken from being classified as spam or ham.
This learning process can be either externally driven (known as 'manual'
learning) or via an automated process from within SA as it scores messages
(known as 'auto' learning). So regardless of whether manual or auto
learning is utilized, tokens are added to the dictionary. It's also
possible to employ both auto & manual learning methods in the same
installation.

There can be one dictionary used for scoring all messages processed (called
"site wide Bayes") or many separate dictionaries, one used for each
recognized user ("per user Bayes"). Either way, the dictionary(s) need to
be updated (and the update process could be either manual, auto, or both).

The Bayes dictionary(s) need to be stored some how, the usual method is
via some kind of database. It could be a simple file based DB, some kind
of fancy SQL server based system or something else. This is a DBA'ish kind
of choice as to what particular technology is used to store the
dictionary DB. (usually on disk in some way, could be in some kind of
memory resident set of tables, or something else???).

So you have a multi-dimensional matrix WRT your Bayes system
configuration, and manual VS auto learning is just one factor.

It's been this way for the past 10+ years AFAIK (well, maybe 10 years
ago it didn't have as many options for back-end database storage, mostly
limited to Berkeley-DB type methods).

I hope this helps you.


--
Dave Funk  University of Iowa
College of Engineering
319/335-5751   FAX: 319/384-0549   1256 Seamans Center
Sys_admin/Postmaster/cell_adminIowa City, IA 52242-1527
#include 
Better is not better, 'standard' is better. B{

Re: Bayes, Manual and Auto Learning Strategies

2014-07-02 Thread Steve Bergman



On 07/02/2014 02:02 AM, Axb wrote:


and don't count on that - they may do it the first week, new toy,
but for how long?


Not new. They'd previously been training SA with Evolution for some 
years. I have some confidence in many of them doing it right.




Also: take in mind each user's Bayes folder also get a a bayes_seen file
which grows and grows and grows and never gets truncated.


Well, I have the maximum bayes toks set at 2,000,000. Is bayes_seen 
likely to become a problem with ~100 users and 4TB of disk space?


My largest email volume user has accumulated only 320k of "seen" in 10 
days. And I assume that repeat spams don't add to it.




Do you really want to spend time watching each user's Bayes?


Not really. But I'll do whatever is necessary.

-Steve



Re: Bayes, Manual and Auto Learning Strategies

2014-07-02 Thread Axb

On 07/02/2014 09:01 AM, Steve Bergman wrote:

Axb,

I'm not sure I quite believe it. And I'm not quite sure I trust you. But
you do make an attractive pitch. Excellent spam filtering, system-wide,
with no responsibility for training on the part of the users?


YOu don't need to trust me or believe me (I'm not selling anything - 
just commenting on what works for me)


You can try it and after a couple of weeks, see if it works for you and 
then if necessary come up with new methods for extra training or dump 
the concept totally.


Bayes is yet another scoring mechanism in SA. If you have enough 
traffic, you can wipe the data any time and it's not like you're 
switching SA off totally.


During the dev/test process of the Redis backend, as stuff changed on a 
daily basis I was forced to purge the Bayes data several times/week.

It even became a running joke (wave Henrik/Marc).


This sounds like the kind of "too good to be true" message that I'd
expect to receive in a spam mail.


:-)



But hmm. This is good dream material for tonight. I wonder if our Ubuntu
14.04 upgrade has SA 3.4 with redis built in. I do hear that the redis
backend is amazing.


Ever thought of running a newer distro in a VM, only for SA and let 
spamass-milter use that?
That would mean you can play with SA 3.4 without having to redo all your 
mail infra?




Re: Bayes, Manual and Auto Learning Strategies

2014-07-02 Thread Axb

On 07/02/2014 08:48 AM, Steve Bergman wrote:

Someone, please convince me that I should turn it on.


autolearn doesn't mean you cannot also train manually...


Should I turn it on and take my "train as ham" entry out of .forward? Or
should I not?


manually training ham from unreviewed data?
bad idea.


I suppose that largely depends upon my individual users' levels of
diligence.


and don't count on that - they may do it the first week, new toy, 
but for how long?


Also: take in mind each user's Bayes folder also get a a bayes_seen file 
which grows and grows and grows and never gets truncated.


Do you really want to spend time watching each user's Bayes?





Re: Bayes, Manual and Auto Learning Strategies

2014-07-02 Thread Steve Bergman

Axb,

I'm not sure I quite believe it. And I'm not quite sure I trust you. But 
you do make an attractive pitch. Excellent spam filtering, system-wide, 
with no responsibility for training on the part of the users?


This sounds like the kind of "too good to be true" message that I'd 
expect to receive in a spam mail.


But hmm. This is good dream material for tonight. I wonder if our Ubuntu 
14.04 upgrade has SA 3.4 with redis built in. I do hear that the redis 
backend is amazing.


-Steve



Re: Bayes, Manual and Auto Learning Strategies

2014-07-01 Thread Steve Bergman
Well... I just turned on autolearn for a moment, deleted the bayes_* 
files on the test account I use, and sent myself a message from my usual 
outside account. And new bayes_* files were created. So I was wrong, and 
I win. More options.


So now I can proceed to the "what does this mean?" phase.

If I leave things as they are, then training is perfect if the users are 
diligent. But if they are not, then... what? I see plenty of spams 
getting through with a 0.0 score. IIRC, the autolearn spam threshold is 
7? Pretty much everything there is spam.


But I'm not sure I quite buy having the static rules of SA training 
Bayes. Isn't Bayes just learning to emulate the static rules, with all 
their imperfections?


If it starts going wrong, doesn't that mean the errors are going to 
spiral out of control?


Leaving autolearn off puts everything in the hands of the users. And 
that's where I've left things for now.


Someone, please convince me that I should turn it on.

Should I turn it on and take my "train as ham" entry out of .forward? Or 
should I not?


I suppose that largely depends upon my individual users' levels of 
diligence.


-Steve


Re: Bayes, Manual and Auto Learning Strategies

2014-07-01 Thread Axb

On 07/02/2014 08:00 AM, Steve Bergman wrote:



On 07/02/2014 12:52 AM, Axb wrote:

Site wide bayes works VERY well even under such ugly conditions as
traffic with multiple languages, for ham as well as spam.


Please tell me more.

This goes against Paul Graham's orginal advice, IIRC. And it goes
against intuition. Then again. Bayesian statistics go against intuition.

It's hard to let go and trust a systen-wide Bayes. But I'm listening...


It works, trust me. SA's Bayes implementation is incredibly robust.


My site wide Bayes DB is not exactly small.

0.000  0   23850755  0  non-token data: nspam
0.000  0   10702302  0  non-token data: nham

Would I run a monster this size of it didn't work? Nope.

I waited a long time to be able to use something really 100% site wide 
(not per server) till we got the ability to use Redis which was FAST, 
robust and doesn't cause me headaches as sql, file permissions issues, etc.


I can't give you a scientific reason for not using per user Bayes
Site wide works for my +2000 corp domains which includes .tr, .ru, .cn, 
.ua, .es, .fr,.de plus a ton of other major CCtld domains


AND: I only run autolearn. NO manual/scheduled training.





Re: Bayes, Manual and Auto Learning Strategies

2014-07-01 Thread John Hardin

On Wed, 2 Jul 2014, Steve Bergman wrote:




On 07/01/2014 11:14 PM, John Hardin wrote:


 Autolearn trains the bayes database. The bayes data is stored wherever
 you configured it to be stored, in a DBM database or SQL or redis, and
 it's per-user if you configure per-user Bayes databases and scan emails
 using different usernames (vs. a global user like root or amavis).


That is interesting. How sure are you of this? Because if you're pretty sure, 
it's a piece of information I've been keen to confirm for a while.


The bayes database is the only thing in SA that can be trained. (I'm 
excluding submission of the message to pyzor et. al. because that's 
obviously not local.)


Odd, though, that before I set up .forward to train incoming mails as ham and 
disabled autolearn, no nhams were showing up in "sa-learn --dump magic" for 
the individual users. Just nspams.


That is rather odd. Very-low-scoring hams should be autolearned as ham 
unless the default thresholds have been changed.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  News flash: Lowest Common Denominator down 50 points
---
 3 days until the 238th anniversary of the Declaration of Independence


Re: Bayes, Manual and Auto Learning Strategies

2014-07-01 Thread Steve Bergman



On 07/02/2014 12:52 AM, Axb wrote:

Site wide bayes works VERY well even under such ugly conditions as
traffic with multiple languages, for ham as well as spam.


Please tell me more.

This goes against Paul Graham's orginal advice, IIRC. And it goes 
against intuition. Then again. Bayesian statistics go against intuition.


It's hard to let go and trust a systen-wide Bayes. But I'm listening...

-Steve


Re: Bayes, Manual and Auto Learning Strategies

2014-07-01 Thread Axb

On 07/02/2014 07:37 AM, Steve Bergman wrote:



Lets turn this around?  Can you prove autolearn was ever done to memory?


I'm not really interested in proving anything. I'm interested in being
convinced that autolearn is individual file-based when spamc is run as
the individual user.


It's in the code... but yes, autolearn is always file based and respects 
the per user settings unless you run  spamd with -x



I'm not quite sure how that would affect my strategy. But it might (or
might not) make autolearn useful.


More important, you may need to reconsider is if per user Bayes will 
give you the level of quality you're aiming for, and from experience I 
can tell you: it won't.


Site wide bayes works VERY well even under such ugly conditions as 
traffic with multiple languages, for ham as well as spam.











Re: Bayes, Manual and Auto Learning Strategies

2014-07-01 Thread Steve Bergman



Lets turn this around?  Can you prove autolearn was ever done to memory?


I'm not really interested in proving anything. I'm interested in being 
convinced that autolearn is individual file-based when spamc is run as 
the individual user.


I'm not quite sure how that would affect my strategy. But it might (or 
might not) make autolearn useful.


-Steve


Re: Bayes, Manual and Auto Learning Strategies

2014-07-01 Thread Axb

On 07/02/2014 07:19 AM, Steve Bergman wrote:



On 07/01/2014 11:49 PM, Karsten Bräckelmann wrote:


Those do not tell you about using file or SQL based databases?


They do. But not specifically with respect to autolearn.

You never

thought about googling for "spamassassin per user" and friends? You
never checked the SA wiki?


I have, indeed. No reference to autolearn and persistent storage. The
lack of mention is notable.

I'd expect people to be lining up to tell me I'm mistaken if I
absolutely were.

Can you point me to a change log somewhere documenting autolearn moving
from in-memory and system-wide to per user and persistent?

I don't hold a strong opinion on this. It would be nice if I were wrong.
It would open more options.

I'm just waiting for evidence that it's the case. My perception is that
It's not.


Lets turn this around?  Can you prove autolearn was ever done to memory?

If you mean  "autolearn to journal", this is also file based.

I've been using SA since before it was an Apache project, when it was 
developed by McAfee and the sources were on Sourceforge and back then it 
was already file based.






Re: Bayes, Manual and Auto Learning Strategies

2014-07-01 Thread Steve Bergman



On 07/01/2014 11:14 PM, John Hardin wrote:


Autolearn trains the bayes database. The bayes data is stored wherever
you configured it to be stored, in a DBM database or SQL or redis, and
it's per-user if you configure per-user Bayes databases and scan emails
using different usernames (vs. a global user like root or amavis).



That is interesting. How sure are you of this? Because if you're pretty 
sure, it's a piece of information I've been keen to confirm for a while.


Odd, though, that before I set up .forward to train incoming mails as 
ham and disabled autolearn, no nhams were showing up in "sa-learn --dump 
magic" for the individual users. Just nspams.


-Steve


Re: Bayes, Manual and Auto Learning Strategies

2014-07-01 Thread Steve Bergman



On 07/01/2014 11:49 PM, Karsten Bräckelmann wrote:


Those do not tell you about using file or SQL based databases?


They do. But not specifically with respect to autolearn.

You never

thought about googling for "spamassassin per user" and friends? You
never checked the SA wiki?


I have, indeed. No reference to autolearn and persistent storage. The 
lack of mention is notable.


I'd expect people to be lining up to tell me I'm mistaken if I 
absolutely were.


Can you point me to a change log somewhere documenting autolearn moving 
from in-memory and system-wide to per user and persistent?


I don't hold a strong opinion on this. It would be nice if I were wrong. 
It would open more options.


I'm just waiting for evidence that it's the case. My perception is that 
It's not.


-Steve


Re: Bayes, Manual and Auto Learning Strategies

2014-07-01 Thread Karsten Bräckelmann
On Tue, 2014-07-01 at 22:40 -0500, Steve Bergman wrote:
> On 07/01/2014 10:21 PM, Karsten Bräckelmann wrote:
> >
> > http://spamassassin.apache.org/doc/Mail_SpamAssassin_Conf.html
> > http://spamassassin.apache.org/doc/Mail_SpamAssassin_Plugin_AutoLearnThreshold.html
> 
> I've read those over and over. It never says anything about where the 
> data is maintained, or whether it's per-user or not. The *only* solid 
> claim I have is a ten year old (yes, at the dawn of SA Bayes) post which 
> specifically says it's in memory, system-wide, and lost upon SA restart.

Those do not tell you about using file or SQL based databases? You never
thought about googling for "spamassassin per user" and friends? You
never checked the SA wiki?

FWIW, the links given do NOT refer to in-memory only at all.

An in-memory only Bayes database definitely is much more than ten years
ago. If it ever existed. No need for me to even check.

> > Milter usually means system-wide. (But since you just asked, it is.)
> 
> I'm using spamass-milter. It suid's to the recipient user for most 
> mails. For aliases it defaults to a particular user who gets an 
> unbelievable amount of spam at the gate, and whom I know sorts his 
> ham/spam religiously.

So you want to check back with your specific setup and its docs.
Suid'ing is pretty likely to be per-user, though the definition of user
is not specifically clear in the context of a milter (and the final
recipient).

In either case, that is not SA specific. (SA happily uses both, per-user
or site-wide config AND bayes database, depending on context.) Refer to
your milter's docs.


> > Irrespective of your feeling -- cheers!  /me having a beer
> 
> Whew! After the conversations I've had here, today, I need one, too! ;-)

Don't see this as an attack on you. It isn't. Just pointers on helping
your understanding of the situation and your issues. Not always gentle,
but that also reflects the initial stance.


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}



Re: Bayes, Manual and Auto Learning Strategies

2014-07-01 Thread Karsten Bräckelmann
On Tue, 2014-07-01 at 22:18 -0500, Steve Bergman wrote:
> On 07/01/2014 09:53 PM, Karsten Bräckelmann wrote:
> 
> > Frankly, it appears you don't understand what auto-learning is.
> 
> So please specify, explicitly, what it is. I asked some specific 
> questions about it. And I'm very interested in the answers.

If you want my opinion, please re-phrase your questions. I locally
deleted most of this previous (originally unrelated) thread.

> Is auto-learn still system-wide? I'd need it to apply to individual 
> users. Is it in-memory only? Or can I have it update the users' filedb 
> token databases?

SA itself never was system-wide, neither user-specific. It is both, can
be either. It depends on the context of calling SA.


> If it's now per user and uses the user databases, then I am more than 
> ready to reconsider my opinion. But I've not been able to get a clear 
> answer to this. I haven't had an opportunity to test. And I'd want 
> confirmation from someone in the know anyway, before I changed strategies.

It does not depend on SA, but on how you invoke SA. We cannot give you a
clear answer. It depends on your system, your SMTP, glue, system wide
calling of SA, and possibly per-user invocations even after system-wide.

To be clear: SA is a filter. It does nothing itself, other than
classification. Being called, and at which point, is outside the scope
of SA. Rejecting, deleting, delivering or any other kind of action is
outside the scope of SA. That's actions performed by the calling layer,
based on the result of SA evaluation.


> >> This method shields the user from the worst of the spam, while giving
> >> them full control of what gets relearned as spam.
> >
> > Wrong. It is not "this" (your) method, that shields the user from the
> > worst of the spam. That's SA. Not your style of auto-training.
> 
> Mine is not autotraining at all. it's giving the user a way of 
> explicitly training the backend spam filter.

Quoting your previous post, you "have a line in the users' default
.forward file to train incoming mail as ham". That is auto-training.

> > (Besides, you *are* doing auto-learning, which you just claimed to be a
> > complete joke.)
> 
> No. The messages are assumed ham until the user classifies it as spam. 
> It is explicit learning. Under user control,

Being "assumed" is not the same as being "treated and automatically
reinforced". The latter is what you do. (And btw, Yes. You are
auto-learning.)


> > At this point I won't get into details. It should suffice to highlight
> > that a default ham auto-learning threshold of 0.1 is part of the safety
> > concepts. (See the M::SA::Plugin::AutoLearnThreshold man-page for more.)
> 
> I really don't think you understand what it is I'm doing. Anything below 
> a score of 5.0 goes into their mailbox and learned as ham. If it's ham, 
> that's great. If it's spam, they move it to Junk and it gets learned as 
> spam. auto-learn is as brain dead as the defunct AWL.

I perfectly understood what you are doing.

You didn't understand why that is bad. Failing to explain might be my
bad, though I'll leave re-explaining for tomorrow my timezone. Or you
carefully re-reading my posts.


> > I never checked the TB internal Bayes implementation and auto-learn
> > strategy, but I'd be surprised if they do train on black/white, without
> > any gray area in between.
> 
> Optimally, I would have an "incoming folder" and then the user could 
> manually move the messages from there to spam or ham. But considering 

Which is basically what you came from, using Dovecot antispam plugin
with SA, and dedicated folders "where the user could manually move the
messages" to. Why didn't you just set that up?

(Hint: That's your set-up without auto-learning ham Inbox deliveries.)

> that this was not even remotely necessary with our old email provider, I 
> don't feel that I can put my users to that level of extra trouble that 
> they never even thought about having to deal with before, just because 
> SA is not performing as well as the spam filter they are used to. The 

Do initial manual training. Then get back to us.

> mail needs to go into the inbox directly. And for SA's bayesian tp work, 
> it needs to be assumed as ham initially.

No.

It seems your previous "email provider", whatever that might be, had
some sort of spam filtering service. Now you're on your own.

Which you are, unless you decide to ask for free (as in beer) support by
the community providing the software for free (as in speech) to help you
weed out the spam. You did ask, which is just fine, but your assumptions
are ki

Re: Bayes, Manual and Auto Learning Strategies

2014-07-01 Thread John Hardin

On Tue, 1 Jul 2014, Steve Bergman wrote:




On 07/01/2014 10:21 PM, Karsten Bräckelmann wrote:


http: //spamassassin.apache.org/doc/Mail_SpamAssassin_Conf.html
http: 
//spamassassin.apache.org/doc/Mail_SpamAssassin_Plugin_AutoLearnThreshold.html


I've read those over and over. It never says anything about where the data is 
maintained, or whether it's per-user or not. The *only* solid claim I have is 
a ten year old (yes, at the dawn of SA Bayes) post which specifically says 
it's in memory, system-wide, and lost upon SA restart.


Autolearn trains the bayes database. The bayes data is stored wherever you 
configured it to be stored, in a DBM database or SQL or redis, and it's 
per-user if you configure per-user Bayes databases and scan emails using 
different usernames (vs. a global user like root or amavis).


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  News flash: Lowest Common Denominator down 50 points
---
 3 days until the 238th anniversary of the Declaration of Independence

Re: Bayes, Manual and Auto Learning Strategies

2014-07-01 Thread Steve Bergman



On 07/01/2014 10:21 PM, Karsten Bräckelmann wrote:


http://spamassassin.apache.org/doc/Mail_SpamAssassin_Conf.html
http://spamassassin.apache.org/doc/Mail_SpamAssassin_Plugin_AutoLearnThreshold.html


I've read those over and over. It never says anything about where the 
data is maintained, or whether it's per-user or not. The *only* solid 
claim I have is a ten year old (yes, at the dawn of SA Bayes) post which 
specifically says it's in memory, system-wide, and lost upon SA restart.



Milter usually means system-wide. (But since you just asked, it is.)


I'm using spamass-milter. It suid's to the recipient user for most 
mails. For aliases it defaults to a particular user who gets an 
unbelievable amount of spam at the gate, and whom I know sorts his 
ham/spam religiously.




Which, referring to my previous post, also means, a single sloppy user
deleting your custom-auto-learned FN ham messages affects all your other
users.


No. I make sure to keep each user solely responsible for their own email 
welfare.



Irrespective of your feeling -- cheers!  /me having a beer


Whew! After the conversations I've had here, today, I need one, too! ;-)


-Steve



Re: Bayes, Manual and Auto Learning Strategies

2014-07-01 Thread Karsten Bräckelmann
On Tue, 2014-07-01 at 20:53 -0500, Steve Bergman wrote:
> On 07/01/2014 07:32 PM, Karsten Bräckelmann wrote:
> 
> > That's pretty bad practice. Fundamentally, you are implementing a custom
> > auto-learn flavor, overruling the SA configurable auto-learn behavior
> 
> BTW, that reminds me of a question I had been meaning to ask on the 
> list. Autolearn. There's very little written about it, so far as I am 

http://spamassassin.apache.org/doc/Mail_SpamAssassin_Conf.html
http://spamassassin.apache.org/doc/Mail_SpamAssassin_Plugin_AutoLearnThreshold.html

> aware. But from what I have gleaned, from old posts, is that it is 
> system-wide and in-memory.

It depends on how you call SA (SMTP or MDA level). SA itself is a
filter, called by your mail-processing chain. Thus, there is no SA
default context of system-wide or per-user. It depends on how you call
it.


> Now, I have Spamass-milter set to run SA 3.3 
> as the recipient user, using the filedb backend. So in 3.3, is autolearn 
> system wide and in memory, or per user and on disk?

Milter usually means system-wide. (But since you just asked, it is.)

Which, referring to my previous post, also means, a single sloppy user
deleting your custom-auto-learned FN ham messages affects all your other
users. Or a non-sloppy, but on-vacation-mode user.

Moreover, there is no in-memory only, not on-disk mode. Unless you don't
have to ask about it.


> This makes a difference regarding what Karsten and I are discussing. I 
> don't suppose I would object to being wrong. But I have a feeling that 
> I'm right.

Irrespective of your feeling -- cheers!  /me having a beer


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}



Re: Bayes, Manual and Auto Learning Strategies

2014-07-01 Thread Steve Bergman



On 07/01/2014 09:53 PM, Karsten Bräckelmann wrote:


Frankly, it appears you don't understand what auto-learning is.


So please specify, explicitly, what it is. I asked some specific 
questions about it. And I'm very interested in the answers.


Is auto-learn still system-wide? I'd need it to apply to individual 
users. Is it in-memory only? Or can I have it update the users' filedb 
token databases?


If it's now per user and uses the user databases, then I am more than 
ready to reconsider my opinion. But I've not been able to get a clear 
answer to this. I haven't had an opportunity to test. And I'd want 
confirmation from someone in the know anyway, before I changed strategies.





This method shields the user from the worst of the spam, while giving
them full control of what gets relearned as spam.


Wrong. It is not "this" (your) method, that shields the user from the
worst of the spam. That's SA. Not your style of auto-training.



Mine is not autotraining at all. it's giving the user a way of 
explicitly training the backend spam filter.



And unless you disabled Bayes auto-learning in SA (dunno, might have
been mentioned deep in the thread), the user does not have full control
of what gets relearned as spam.



I have disabled autolearning. I thought I mentioned that to you.



(Besides, you *are* doing auto-learning, which you just claimed to be a
complete joke.)


No. The messages are assumed ham until the user classifies it as spam. 
It is explicit learning. Under user control,




At this point I won't get into details. It should suffice to highlight
that a default ham auto-learning threshold of 0.1 is part of the safety
concepts. (See the M::SA::Plugin::AutoLearnThreshold man-page for more.)



I really don't think you understand what it is I'm doing. Anything below 
a score of 5.0 goes into their mailbox and learned as ham. If it's ham, 
that's great. If it's spam, they move it to Junk and it gets learned as 
spam. auto-learn is as brain dead as the defunct AWL.




I never checked the TB internal Bayes implementation and auto-learn
strategy, but I'd be surprised if they do train on black/white, without
any gray area in between.


Optimally, I would have an "incoming folder" and then the user could 
manually move the messages from there to spam or ham. But considering 
that this was not even remotely necessary with our old email provider, I 
don't feel that I can put my users to that level of extra trouble that 
they never even thought about having to deal with before, just because 
SA is not performing as well as the spam filter they are used to. The 
mail needs to go into the inbox directly. And for SA's bayesian tp work, 
it needs to be assumed as ham initially.


The only thing I see which might change my view would be explicit 
details about where autolearn stores its data and how it is used on a 
per user basis.


-Steve



Re: Bayes, Manual and Auto Learning Strategies

2014-07-01 Thread Karsten Bräckelmann
On Tue, 2014-07-01 at 20:36 -0500, Steve Bergman wrote:
> On 07/01/2014 07:32 PM, Karsten Bräckelmann wrote:
> >
> > That's pretty bad practice. Fundamentally, you are implementing a custom
> > auto-learn flavor, overruling the SA configurable auto-learn behavior
> 
> SA's autolearn behavior doesn't make much sense. I have no confidence in it.

The auto-learning feature is NOT meant to be a fully automated training
system. It's an aid for the user to eliminate the need to care about the
extremes, while focusing on the close-calls. There are options to tweak
to your specific needs, and there even is no single "SA autolearn
behavior" as you stated, but different flavors. And an option to turn it
off.

Frankly, it appears you don't understand what auto-learning is.

> This method shields the user from the worst of the spam, while giving 
> them full control of what gets relearned as spam.

Wrong. It is not "this" (your) method, that shields the user from the
worst of the spam. That's SA. Not your style of auto-training.

And unless you disabled Bayes auto-learning in SA (dunno, might have
been mentioned deep in the thread), the user does not have full control
of what gets relearned as spam.


> > and ignoring all safety concepts implemented by SA.
> 
> What safety concepts? autolearn is a complete joke. Even the docs 
> explain that it's only there as a last resort method of kinda sorta 
> training the spam filter.

You are doing (custom) auto-learning as ham of any message with a score
less than required_score of 5.0. *That* is a joke.

(Besides, you *are* doing auto-learning, which you just claimed to be a
complete joke.)

At this point I won't get into details. It should suffice to highlight
that a default ham auto-learning threshold of 0.1 is part of the safety
concepts. (See the M::SA::Plugin::AutoLearnThreshold man-page for more.)


> > So if a user in a hurry simply deletes some spam, it will remain ham, as
> > far as Bayes is concerned.
> 
> Same as with Thunderbird, I think.

I never checked the TB internal Bayes implementation and auto-learn
strategy, but I'd be surprised if they do train on black/white, without
any gray area in between.

You stated it. Please back up your claim.


> And it's working very well for them. 
> If they act irresponsibly, they'll get more spam. It takes no longer to 
> highlight the spam and click "Junk" than it does to highlight the spam 
> and click "Delete".

While I am aware I'm not the average user -- there's a "delete" action
key on my keyboard. There's no "junk" equivalent. Yes, I avoid using the
mouse if keyboard interaction is more productive...


> I've pretty much decided at this point that if the users don't do what I 
> tell them to do, repeatedly, then what results is not my responsibility.
> 
> And it's not.

Do you hate your users or your job? (Sorry, snide-remark I couldn't
resist. Feel free to ignore.)

> The alternative is to not mark incoming mail as ham, and allow the SA 
> Bayesian filter to remain inactive forever.

No. I can only guess, but it appears there are some mis-interpretations
in that conclusion.

The SA Bayesian classifier to "remain inactive forever" can only refer
to insufficient initial training. Manual training. Of at least 200 ham
and spam each (by default, you can lower that to 0). You will easily get
that by manual training of existing messages. And even default auto-
learning would eventually cross the ham number. Less than forever.

More importantly, SA still marks (classifies) incoming mail as ham. Just
because its overall score is less than 5.0. It just does not *learn* all
of them as ham. Because there's a chance it might not actually be ham,
but a FN.

That area, between (default) auto-learning as ham and classifying as
spam is the gray area, where actual user input is of much value. For
both, learning spam AND ham, for that matter. In particular, because
generally (and as SA principle), a FP is *much* worse than a FN.


Your approach of force learning those as ham, is biasing your Bayes DB.
At the very least temporarily (unless a fresh spam campaign has been
re-trained by your users on Monday). At worst, until you clear it.

Btw, is that per-user, or are you gambling a site-wide Bayes DB?


> I opted to give the users the choice of being responsible for sorting, 
> and reaping the benefits of that if they do. And yes, I know that some 
> are not going to.
> 
> I'd be interested if you have a better solution in mind.

Do not auto-learn ham every message that scores below required_score.

Introduce train-on-error for your users, with an extended manual
training option. Specific ham and spam folders, where moving or copying
mail into trains the Bayes classifier. Kind of optional for the user,
unless they feel there's too much mis-classification.


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}



Re: Bayes, Manual and Auto Learning Strategies

2014-07-01 Thread Steve Bergman



On 07/01/2014 07:32 PM, Karsten Bräckelmann wrote:


That's pretty bad practice. Fundamentally, you are implementing a custom
auto-learn flavor, overruling the SA configurable auto-learn behavior


BTW, that reminds me of a question I had been meaning to ask on the 
list. Autolearn. There's very little written about it, so far as I am 
aware. But from what I have gleaned, from old posts, is that it is 
system-wide and in-memory. Now, I have Spamass-milter set to run SA 3.3 
as the recipient user, using the filedb backend. So in 3.3, is autolearn 
system wide and in memory, or per user and on disk?


This makes a difference regarding what Karsten and I are discussing. I 
don't suppose I would object to being wrong. But I have a feeling that 
I'm right.


-Steve


Re: Bayes, Manual and Auto Learning Strategies

2014-07-01 Thread Steve Bergman



On 07/01/2014 07:32 PM, Karsten Bräckelmann wrote:


That's pretty bad practice. Fundamentally, you are implementing a custom
auto-learn flavor, overruling the SA configurable auto-learn behavior


SA's autolearn behavior doesn't make much sense. I have no confidence in it.

This method shields the user from the worst of the spam, while giving 
them full control of what gets relearned as spam.



and ignoring all safety concepts implemented by SA.


What safety concepts? autolearn is a complete joke. Even the docs 
explain that it's only there as a last resort method of kinda sorta 
training the spam filter.




So if a user in a hurry simply deletes some spam, it will remain ham, as
far as Bayes is concerned.


Same as with Thunderbird, I think. And it's working very well for them. 
If they act irresponsibly, they'll get more spam. It takes no longer to 
highlight the spam and click "Junk" than it does to highlight the spam 
and click "Delete".


I've pretty much decided at this point that if the users don't do what I 
tell them to do, repeatedly, then what results is not my responsibility.


And it's not.

The alternative is to not mark incoming mail as ham, and allow the SA 
Bayesian filter to remain inactive forever.


I opted to give the users the choice of being responsible for sorting, 
and reaping the benefits of that if they do. And yes, I know that some 
are not going to.


I'd be interested if you have a better solution in mind.

-Steve


Bayes, Manual and Auto Learning Strategies (was: Re: getting tons of SPAM)

2014-07-01 Thread Karsten Bräckelmann
On Tue, 2014-07-01 at 18:43 -0500, Steve Bergman wrote:
> On 07/01/2014 06:09 PM, RW wrote:
> > I'm sceptical about the use of Dovecot-Antispam with Spamassassin.
> > The problem is that it trains on SpamAssassin errors rather than Bayes
> > errors. It may be possible to get sufficient spam this way, but ham
> > is learned very slowly through avoidable FPs.
> 
> We currently (early days for this installation) get plenty of spam for 
> the users to train by moving it to the junk folder. Ham was the problem. 
> Dovecot does nothing about training ham.

Dovecot (and its antispam plugin) does nothing about training ham,
either. It offers target folders and triggers, for easy manual (re-)
classification -- and thus training -- of ham and spam.

> That's why I have a line in the users' default .forward file to train
> incoming mail as ham.

That's pretty bad practice. Fundamentally, you are implementing a custom
auto-learn flavor, overruling the SA configurable auto-learn behavior
and ignoring all safety concepts implemented by SA. There's a reason for
the ham and spam learning thresholds, and the ham threshold to be 0.1 by
default, *not* equaling required_score's default of 5.0.

> Then if they or Thunderbird decide to move the mail to Junk, it gets
> re-trained as spam.

So if a user in a hurry simply deletes some spam, it will remain ham, as
far as Bayes is concerned.


> dovecot-antispam is *not* a complete solution, so far as I can see.
> 
> At this early stage, it *is* painful to watch all that spam coming in 
> over the weekend getting trained as ham. I tell my users to mark it as 
> spam on Monday morning. And if they don't, I just figure it's not my fault.

It is your fault to implement a broken training strategy.

> Once the token databases get larger there won't be so much potential 
> flux back and forth, I guess.


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}



Re: Bayes auto-learning a bad idea?

2011-10-01 Thread Matus UHLAR - fantomas

On 28.09.11 10:07, Lars Jørgensen wrote:
Not sure if this is the correct forum, but google couldn't help me 
(or I am too low on caffeine).


I get a lot of spam that would have been flagged as such, but a bayes 
score of -1.9 pulls it down to hammy status.


I train Bayes manually on the borderline cases, but also have 
auto-learning enabled. Is that really a bad idea? Should I disable 
it, delete the bayes-databases and start over on manual-only 
learning?


do you run manual learning? Keeping it only automatic learning can 
easily make things go wrong and let people think bayes is bad. 

If you re-train on those that misfired, you should get BAYES hitting 
properly soon.


(Providing you didn't misconfigure on e.g. trusted_networks or 
internal_networks. That could break SA very "effectively").

--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Posli tento mail 100 svojim znamim - nech vidia aky si idiot
Send this email to 100 your friends - let them see what an idiot you are


Re: Bayes auto-learning a bad idea?

2011-09-28 Thread RW
On Wed, 28 Sep 2011 14:30:32 +0200
Lars Jørgensen wrote:

> Looking at 
> http://spamassassin.apache.org/full/3.3.x/doc/Mail_SpamAssassin_Conf.html#learning_options
>  
> i see an option called "bayes_use_hapaxes" that promises
> significantly better hit-rates, but also increases database size by a
> factor of 8 to 10. 

I've never understood what this is supposed to mean, and I suspect it
it's just plain wrong. bayes_use_hapaxes determines whether hapaxes
(tokens with a total count of 1) are used in the calculation. It
doesn't affect whether they are stored; and it can't since all tokens
start-off as hapaxes. It might have a marginal effect through the
updating of atimes, but in that case it's expediting the removal of the
most useful hapaxes.

> What is the recommendation on this? 

I'd leave it on.





Re: Bayes auto-learning a bad idea?

2011-09-28 Thread Benny Pedersen

On Wed, 28 Sep 2011 14:30:32 +0200, Lars Jørgensen wrote:

On 28-09-2011 13:20, Benny Pedersen wrote:

I train Bayes manually on the borderline cases, but also have
auto-learning enabled. Is that really a bad idea? Should I disable 
it,

delete the bayes-databases and start over on manual-only learning?


no training is always good


Are you missing a comma? Do you mean "no, training is always good" or
"no training is always good"?


no just my bolsk algebra and english is bad :)


what score are you learning on ?, default is -0.1 and 12.0, i have
changed them here to -4 and 14


Can't find any settings to that effect, so I guess I am using
defaults. I have entered your settings in my config now.


perldoc Mail::SpamAssassin::Plugin::AutoLearnThreshold



Looking at

http://spamassassin.apache.org/full/3.3.x/doc/Mail_SpamAssassin_Conf.html#learning_options
i see an option called "bayes_use_hapaxes" that promises
significantly better hit-rates, but also increases database size by a
factor of 8 to 10. What is the recommendation on this?


dont known for sure what is best there, using default here

perldoc Mail::SpamAssassin::Plugin::Bayes
perldoc Mail::SpamAssassin::Conf

for 3.3.1 and above i add in local.cf

bayes_auto_learn_on_error 1

reduce poising bayes and load


If throughput
is a factor in this decision, we are scanning about 60,000 to 90,000
mails a day.


more then my server handle now




what plugins have you enabled ?


DCC
pyzor/razor
SpamCop
AutoLearnThreshold
TextCat
MIMEHeader
ReplaceTags
DKIM
Check
HTTPSMismatch
URIDetail
Bayes
All the EvalTest plugins
VBounce
ImageInfo
FreeMail


3dr party rules or just default sa 3.3.2 ?


Default and Sought Rules.


should be safe enough to not give any problem to bayes

tip if you like to restart learning bayes on can do this like here:

sa-learn --dump magic

bayes_min_ham_num (Default: 200)
bayes_min_spam_num (Default: 200)

and adjust this with 200 more then listed in dump magic, this ensure 
that bayes go back in learning mode





Re: Bayes auto-learning a bad idea?

2011-09-28 Thread Lars Jørgensen

On 28-09-2011 13:20, Benny Pedersen wrote:

I train Bayes manually on the borderline cases, but also have
auto-learning enabled. Is that really a bad idea? Should I disable it,
delete the bayes-databases and start over on manual-only learning?


no training is always good


Are you missing a comma? Do you mean "no, training is always good" or 
"no training is always good"?



what score are you learning on ?, default is -0.1 and 12.0, i have
changed them here to -4 and 14


Can't find any settings to that effect, so I guess I am using defaults. 
I have entered your settings in my config now.


Looking at 
http://spamassassin.apache.org/full/3.3.x/doc/Mail_SpamAssassin_Conf.html#learning_options 
i see an option called "bayes_use_hapaxes" that promises significantly 
better hit-rates, but also increases database size by a factor of 8 to 
10. What is the recommendation on this? If throughput is a factor in 
this decision, we are scanning about 60,000 to 90,000 mails a day.



what plugins have you enabled ?


DCC
pyzor/razor
SpamCop
AutoLearnThreshold
TextCat
MIMEHeader
ReplaceTags
DKIM
Check
HTTPSMismatch
URIDetail
Bayes
All the EvalTest plugins
VBounce
ImageInfo
FreeMail


3dr party rules or just default sa 3.3.2 ?


Default and Sought Rules.


--
Lars



Re: Bayes auto-learning a bad idea?

2011-09-28 Thread Benny Pedersen

On Wed, 28 Sep 2011 10:07:55 +0200, Lars Jørgensen wrote:

Hi,

Not sure if this is the correct forum, but google couldn't help me
(or I am too low on caffeine).

I get a lot of spam that would have been flagged as such, but a bayes
score of -1.9 pulls it down to hammy status.

I train Bayes manually on the borderline cases, but also have
auto-learning enabled. Is that really a bad idea? Should I disable 
it,

delete the bayes-databases and start over on manual-only learning?


no training is always good, its more like that bayes is unsure thats 
the problem, when it autolearn it does it on whole content/headers, so 
the more heders/content there is scanning of the better bayes can track 
what you want as ham/spam


what score are you learning on ?, default is -0.1 and 12.0, i have 
changed them here to -4 and 14


what plugins have you enabled ?

3dr party rules or just default sa 3.3.2 ?



Bayes auto-learning a bad idea?

2011-09-28 Thread Lars Jørgensen

Hi,

Not sure if this is the correct forum, but google couldn't help me (or I 
am too low on caffeine).


I get a lot of spam that would have been flagged as such, but a bayes 
score of -1.9 pulls it down to hammy status.


I train Bayes manually on the borderline cases, but also have 
auto-learning enabled. Is that really a bad idea? Should I disable it, 
delete the bayes-databases and start over on manual-only learning?



--
Lars


Re: prevent rule from being considered for Bayes auto-learning

2010-10-21 Thread Jason Bertoch

On 2010/10/21 12:17 PM, Michael Scheidell wrote:

we decided that we didn't too much care to auto learn as 'not spam',
emails sent from marketing companies, (because the reverse is true for
auto learn ham) thus:

aa_scores.cf:tflags RCVD_IN_DNSWL_HI net nice noautolearn
aa_scores.cf:tflags RCVD_IN_DNSWL_MED net nice noautolearn
aa_scores.cf:tflags RCVD_IN_DNSWL_LOW  net nice noautolearn
aa_scores.cf:tflags RCVD_IN_RP_SAFE net nice noautolearn
aa_scores.cf:tflags RCVD_IN_RP_CERTIFIED net nice noautolearn


I actually filed a bug on this...

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6344

--
/Jason



smime.p7s
Description: S/MIME Cryptographic Signature


Re: prevent rule from being considered for Bayes auto-learning

2010-10-21 Thread Lawrence @ Rogers

On 21/10/2010 2:17 PM, Karsten Bräckelmann wrote:

On Thu, 2010-10-21 at 18:39 +0200, Karsten Bräckelmann wrote:

See M::SA::Plugin::AutoLearnThreshold. In a nutshell,  (a) there are a
few tflags that will prevent a rule's score to be used for auto-learning
and  (b) the score used is picked from the respective non-bayes
score-set.

With (a) you can make a rule invisible to the auto-learning decision.
And by setting the scores for score-set 0 and 1 both to 0 as per (b),
you can effectively disable a rule unless Bayes is enabled.

... *and* have that rule "ignored" for the auto-learning decision, if
Bayes and auto-learn is enabled. (Actually not ignored, but adding zero
doesn't influence the result. ;)

The tflags way is much more straight forward, though.



You cannot, however, create a rule to conditionally prevent auto-
learning altogether (which, as I understand isn't what you had in mind
anyway).
Thanks everyone, I have set the rule to noautolearn using the tflags 
directive (this is what I wanted, for the rule to simply not be 
considered when auto-learning).


- Lawrence


Re: prevent rule from being considered for Bayes auto-learning

2010-10-21 Thread Karsten Bräckelmann
On Thu, 2010-10-21 at 18:39 +0200, Karsten Bräckelmann wrote:
> See M::SA::Plugin::AutoLearnThreshold. In a nutshell,  (a) there are a
> few tflags that will prevent a rule's score to be used for auto-learning
> and  (b) the score used is picked from the respective non-bayes
> score-set.
> 
> With (a) you can make a rule invisible to the auto-learning decision.
> And by setting the scores for score-set 0 and 1 both to 0 as per (b),
> you can effectively disable a rule unless Bayes is enabled.

... *and* have that rule "ignored" for the auto-learning decision, if
Bayes and auto-learn is enabled. (Actually not ignored, but adding zero
doesn't influence the result. ;)

The tflags way is much more straight forward, though.


> You cannot, however, create a rule to conditionally prevent auto-
> learning altogether (which, as I understand isn't what you had in mind
> anyway).

-- 
char *t="\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}



Re: prevent rule from being considered for Bayes auto-learning

2010-10-21 Thread Karsten Bräckelmann
On Thu, 2010-10-21 at 13:27 -0230, Lawrence @ Rogers wrote:
> I recall reading somewhere that there is a way to prevent a rule from 
> being considered for Bayes auto-learning. I am trying to create a rule 
   ^ ^
> that hits upon some obvious spam that I am seeing, yet I want to make 
> sure (for now) that any scores it assigns are not used for anything 
> Bayes-related. I cannot seem to find any documentation on how to do this 
> (Google doesn't help). I think it is something to do with setting a 
> tflag, but any guidance would be appreciated.
  ^

Yup, that's correct. Though your google-fu today... The three marked
strings from your own description leads to perfect documentation. :)

See M::SA::Plugin::AutoLearnThreshold. In a nutshell,  (a) there are a
few tflags that will prevent a rule's score to be used for auto-learning
and  (b) the score used is picked from the respective non-bayes
score-set.

With (a) you can make a rule invisible to the auto-learning decision.
And by setting the scores for score-set 0 and 1 both to 0 as per (b),
you can effectively disable a rule unless Bayes is enabled.

You cannot, however, create a rule to conditionally prevent auto-
learning altogether (which, as I understand isn't what you had in mind
anyway).


-- 
char *t="\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}



Re: prevent rule from being considered for Bayes auto-learning

2010-10-21 Thread Michael Scheidell

On 10/21/10 11:57 AM, Lawrence @ Rogers wrote:

Hi,

I recall reading somewhere that there is a way to prevent a rule from 
being considered for Bayes auto-learning. I am trying to create a rule 
that hits upon some obvious spam that I am seeing, yet I want to make 
sure (for now) that any scores it assigns are not used for anything 
Bayes-related. I cannot seem to find any documentation on how to do 
this (Google doesn't help). I think it is something to do with setting 
a tflag, but any guidance would be appreciated.


you can prevent your rule from being considered in the DECISION as to if 
it will auto learn the tokens in the email with tflag noautolearn.


I don't know of any flag that would prevent the rule itself, so example:

rule1, hits 15 points
rule2(bayes) hits 4 points. total is 19 points.  if rule2 has 
noautolearn flag, then the 'do we auto learn this' score is only 15 points.
if your threshold is > 15.1, then the whole email is not considered for 
auto learning.


if rule3 hits 2 points, your total score is 21 points, but 'decision' 
delta is at 17 points now, and whole email is autolearned as spam.


we decided that we didn't too much care to auto learn as 'not spam', 
emails sent from marketing companies, (because the reverse is true for 
auto learn ham) thus:


aa_scores.cf:tflags RCVD_IN_DNSWL_HI net nice noautolearn
aa_scores.cf:tflags RCVD_IN_DNSWL_MED net nice noautolearn
aa_scores.cf:tflags RCVD_IN_DNSWL_LOW  net nice noautolearn
aa_scores.cf:tflags RCVD_IN_RP_SAFE net nice noautolearn
aa_scores.cf:tflags RCVD_IN_RP_CERTIFIED net nice noautolearn
aa_scores.cf:tflags RCVD_IN_IADB_UT_CPR_MAT net nice noautolearn
aa_scores.cf:tflags RCVD_IN_IADB_UT_CPR_30   net nice noautolearn
aa_scores.cf:tflags RCVD_IN_IADB_UT_CPEARnet nice noautolearn
aa_scores.cf:tflags RCVD_IN_IADB_UNVERIFIED_2net nice noautolearn
aa_scores.cf:tflags RCVD_IN_IADB_UNVERIFIED_1net nice noautolearn
aa_scores.cf:tflags RCVD_IN_IADB_SPF net nice noautolearn
aa_scores.cf:tflags RCVD_IN_IADB_SENDERIDnet nice noautolearn
aa_scores.cf:tflags RCVD_IN_IADB_RDNSnet nice noautolearn
aa_scores.cf:tflags RCVD_IN_IADB_OPTIN_LT50  net nice noautolearn
aa_scores.cf:tflags RCVD_IN_IADB_OPTIN_GT50  net nice noautolearn
aa_scores.cf:tflags RCVD_IN_IADB_OPTINnet nice noautolearn
aa_scores.cf:tflags RCVD_IN_IADB_ML_DOPTINnet nice noautolearn
aa_scores.cf:tflags RCVD_IN_IADB_MI_CPR_MATnet nice noautolearn
aa_scores.cf:tflags RCVD_IN_IADB_MI_CPR_30net nice noautolearn
aa_scores.cf:tflags RCVD_IN_IADB_MI_CPEARnet nice noautolearn


Regards,

Lawrence Williams
LCWSoft
www.lcwsoft.com



--
Michael Scheidell, CTO
o: 561-999-5000
d: 561-948-2259
ISN: 1259*1300
>*| *SECNAP Network Security Corporation

   * Certified SNORT Integrator
   * 2008-9 Hot Company Award Winner, World Executive Alliance
   * Five-Star Partner Program 2009, VARBusiness
   * Best in Email Security,2010: Network Products Guide
   * King of Spam Filters, SC Magazine 2008

__
This email has been scanned and certified safe by SpammerTrap(r). 
For Information please see http://www.secnap.com/products/spammertrap/
__  


RE: prevent rule from being considered for Bayes auto-learning

2010-10-21 Thread Kevin Miller
Lawrence @ Rogers wrote:
> Hi,
> 
> I recall reading somewhere that there is a way to prevent a rule from
> being considered for Bayes auto-learning. I am trying to create a
> rule that hits upon some obvious spam that I am seeing, yet I want to
> make sure (for now) that any scores it assigns are not used for
> anything Bayes-related. I cannot seem to find any documentation on
> how to do this (Google doesn't help). I think it is something to do
> with setting a tflag, but any guidance would be appreciated.  

I think you're looking for this:
  tflags  YOUR_RULENAME   noautolearn

HTH...

...Kevin
-- 
Kevin MillerRegistered Linux User No: 307357
CBJ MIS Dept.   Network Systems Admin., Mail Admin.
155 South Seward Street ph: (907) 586-0242
Juneau, Alaska 99801fax: (907 586-4500

prevent rule from being considered for Bayes auto-learning

2010-10-21 Thread Lawrence @ Rogers

Hi,

I recall reading somewhere that there is a way to prevent a rule from 
being considered for Bayes auto-learning. I am trying to create a rule 
that hits upon some obvious spam that I am seeing, yet I want to make 
sure (for now) that any scores it assigns are not used for anything 
Bayes-related. I cannot seem to find any documentation on how to do this 
(Google doesn't help). I think it is something to do with setting a 
tflag, but any guidance would be appreciated.


Regards,

Lawrence Williams
LCWSoft
www.lcwsoft.com


Re: Mailbox for auto learning

2009-08-12 Thread Luis Daniel Lucio Quiroz
Le mardi 11 août 2009 05:12:05, Cedric Knight a écrit :
> Luis Daniel Lucio Quiroz wrote:
> > Le lundi 10 août 2009 19:15:15, Cedric Knight a écrit :
> >> Stefan wrote:
>
> [...]
>
> >>> You have to forward the message as an attachment un unpack it after
> >>> receiving. Have a look at:
> >>> https://po2.uni-stuttgart.de/~rusjako/sal-wrapper
> >>
> >> Yes, I find this approach works well.  It's the simplest way for me to
> >> train Bayes, and most users can cope with it, providing they're not
> >> using Outlook 2003/XP which can't forward as an attachment.  But
> >> Thunderbird, Outlook Express, Squirrelmail and Pine all can easily.
> >> It's not as simple as a 'This Is Spam' button perhaps, and that's a
> >> *good* thing.  Requiring a little bit of thought stops people using it
> >> as an alternative to the delete key for 'OK, perhaps I did subscribe to
> >> this but I don't want it now'.
>
> [...]
>
> > Yes but problem is that 99% of users are about using some kind of outlook
>
> Well then, tell them not to :)  Outlook Express and Windows Mail are
> fine.  Outlook 2003 supposedly needs a special program like
> http://www.olspamcop.org/ to forward properly, although if you select
> multiple messages to forward, then it will forward them in some kind of
> possibly useful digest format.  Outlook 2007 introduces an explicit menu
> item called "forward as an attachment" (Ctrl+Alt+F) but still mangles
> the headers:
> http://forum.spamcop.net/forums/index.php?showtopic=10241&st=0&p=70453&#ent
>ry70453
>
> Outlook 2007 also mangles the headers (kind of reconstructing a
> misleading semblance of what the original was) when moving between IMAP
> folders.  Therefore, I wouldn't use spamassassin -r on spam from Outlook
> users, but sa-learn to get tokens from the body text may be OK.
>
> Actually, some users of Outlook 2003 do seem to be able to forward as
> intact message/rfc822 attachment.  Not exactly sure how.
>
> Anyway, the 1% using a better e-mail program may be all that's needed to
> train Bayes.
>
> CK

Tha nkx

I did resolve it by using altermime+postfix solution. I look my X-quarantine 
heather to get the mail_id and then i add that file.

Rustique, mais il marche

LD


Re: Mailbox for auto learning

2009-08-11 Thread Cedric Knight
Luis Daniel Lucio Quiroz wrote:
> Le lundi 10 août 2009 19:15:15, Cedric Knight a écrit :
>> Stefan wrote:
[...]
>>> You have to forward the message as an attachment un unpack it after
>>> receiving. Have a look at:
>>> https://po2.uni-stuttgart.de/~rusjako/sal-wrapper
>> Yes, I find this approach works well.  It's the simplest way for me to
>> train Bayes, and most users can cope with it, providing they're not
>> using Outlook 2003/XP which can't forward as an attachment.  But
>> Thunderbird, Outlook Express, Squirrelmail and Pine all can easily.
>> It's not as simple as a 'This Is Spam' button perhaps, and that's a
>> *good* thing.  Requiring a little bit of thought stops people using it
>> as an alternative to the delete key for 'OK, perhaps I did subscribe to
>> this but I don't want it now'.
[...]

> Yes but problem is that 99% of users are about using some kind of outlook

Well then, tell them not to :)  Outlook Express and Windows Mail are
fine.  Outlook 2003 supposedly needs a special program like
http://www.olspamcop.org/ to forward properly, although if you select
multiple messages to forward, then it will forward them in some kind of
possibly useful digest format.  Outlook 2007 introduces an explicit menu
item called "forward as an attachment" (Ctrl+Alt+F) but still mangles
the headers:
http://forum.spamcop.net/forums/index.php?showtopic=10241&st=0&p=70453&#entry70453

Outlook 2007 also mangles the headers (kind of reconstructing a
misleading semblance of what the original was) when moving between IMAP
folders.  Therefore, I wouldn't use spamassassin -r on spam from Outlook
users, but sa-learn to get tokens from the body text may be OK.

Actually, some users of Outlook 2003 do seem to be able to forward as
intact message/rfc822 attachment.  Not exactly sure how.

Anyway, the 1% using a better e-mail program may be all that's needed to
train Bayes.

CK



Re: Mailbox for auto learning

2009-08-10 Thread Luis Daniel Lucio Quiroz
Le lundi 10 août 2009 19:15:15, Cedric Knight a écrit :
> Stefan wrote:
> > Am Sonntag, 9. August 2009 07:36:54 schrieb Luis Daniel Lucio Quiroz:
> >> Hi SAs,
> >>
> >> Well, after reading this link
> >> http://spamassassin.apache.org/full/3.2.x/doc/sa-learn.html I'm still
> >> looking for an easy-way to let my mortal users to train our antispam.  I
> >> was thinking a mailbox such as  h...@antispamserver and
> >> s...@antispamserver to let users to forward their false positivos or
> >> their false netgatives. In isde each box (ham or spam), of course a
> >> procmail with sa-learn input will be forwarded.
> >>
> >> My doubts are nexts:
> >> 1. Will forwarded mails be usefull for training, I mean if spam was:
> >> From: spa...@example.netTo: u...@mydomain,   when forwarding it will
> >> be From: mu...@mydomain To: s...@antispamserver.   Change of this and
> >> forwarding (getting rid of headers because mail-clients) wont change
> >> learning?
> >
> > You have to forward the message as an attachment un unpack it after
> > receiving. Have a look at:
> > https://po2.uni-stuttgart.de/~rusjako/sal-wrapper
>
> Yes, I find this approach works well.  It's the simplest way for me to
> train Bayes, and most users can cope with it, providing they're not
> using Outlook 2003/XP which can't forward as an attachment.  But
> Thunderbird, Outlook Express, Squirrelmail and Pine all can easily.
> It's not as simple as a 'This Is Spam' button perhaps, and that's a
> *good* thing.  Requiring a little bit of thought stops people using it
> as an alternative to the delete key for 'OK, perhaps I did subscribe to
> this but I don't want it now'.
>
> My script is very similar to sal-wrapper, using Postfix
> check_recipient_access to ensure only authenticated users can send to
> the reporting address; triggered from procmail; using MIME::Parser to
> extract (possibly multiple) message/rf822 attachments; feed through
> sa-learn --ham or spamassassin -r as appropriate and send an
> acknowledgement back to the user, to remind them to also send
> spam/non-spam to the corresponding address and correct any mistakes.
>
> One thing I notice from sal-wrapper however is that it pipes the header
> and body to sa-learn without passing a file as parameter.  I found that
> although sa-learn didn't complain, this didn't work at all well, and
> quite short ham messages were scoring BAYES_99.  You can pipe to
> spamassassin -r just like you can to spamassassin in any other mode, but
> I think if you pipe to sa-learn, you need to do it as
>sa-learn --ham -
>
> with the '-' as parameter, so it reads the standard input.
> Alternatively feed it a temporary message file.  Or am I misreading
> something?
>
> CK
Yes but problem is that 99% of users are about using some kind of outlook


Re: Mailbox for auto learning

2009-08-10 Thread Cedric Knight
Stefan wrote:
> Am Sonntag, 9. August 2009 07:36:54 schrieb Luis Daniel Lucio Quiroz:
>> Hi SAs,
>>
>> Well, after reading this link
>> http://spamassassin.apache.org/full/3.2.x/doc/sa-learn.html I'm still
>> looking for an easy-way to let my mortal users to train our antispam.  I
>> was thinking a mailbox such as  h...@antispamserver and s...@antispamserver
>> to let users to forward their false positivos or their false netgatives. 
>> In isde each box (ham or spam), of course a procmail with sa-learn input
>> will be forwarded.
>>
>> My doubts are nexts:
>> 1. Will forwarded mails be usefull for training, I mean if spam was: From:
>> spa...@example.netTo: u...@mydomain,   when forwarding it will be From:
>> mu...@mydomain To: s...@antispamserver.   Change of this and forwarding
>> (getting rid of headers because mail-clients) wont change learning?
> 
> You have to forward the message as an attachment un unpack it after 
> receiving. 
> Have a look at: 
> https://po2.uni-stuttgart.de/~rusjako/sal-wrapper

Yes, I find this approach works well.  It's the simplest way for me to
train Bayes, and most users can cope with it, providing they're not
using Outlook 2003/XP which can't forward as an attachment.  But
Thunderbird, Outlook Express, Squirrelmail and Pine all can easily.
It's not as simple as a 'This Is Spam' button perhaps, and that's a
*good* thing.  Requiring a little bit of thought stops people using it
as an alternative to the delete key for 'OK, perhaps I did subscribe to
this but I don't want it now'.

My script is very similar to sal-wrapper, using Postfix
check_recipient_access to ensure only authenticated users can send to
the reporting address; triggered from procmail; using MIME::Parser to
extract (possibly multiple) message/rf822 attachments; feed through
sa-learn --ham or spamassassin -r as appropriate and send an
acknowledgement back to the user, to remind them to also send
spam/non-spam to the corresponding address and correct any mistakes.

One thing I notice from sal-wrapper however is that it pipes the header
and body to sa-learn without passing a file as parameter.  I found that
although sa-learn didn't complain, this didn't work at all well, and
quite short ham messages were scoring BAYES_99.  You can pipe to
spamassassin -r just like you can to spamassassin in any other mode, but
I think if you pipe to sa-learn, you need to do it as
   sa-learn --ham -

with the '-' as parameter, so it reads the standard input.
Alternatively feed it a temporary message file.  Or am I misreading
something?

CK



Re: Mailbox for auto learning

2009-08-10 Thread Jari Fredriksson
> Stefan wrote:

> This may not be ideal, but in Thunderbird, you can drag
> messages between mailboxes. You could setup each user to
> have access to their own account and the two learning
> mailboxes. You can then have your users drag the false
> positives/negatives to the appropriate box. I have not
> testing this 100%, so I don't know if any headers get
> re-written or not.  

This is possible only when using IMAP. Not POP.

When using IMAP, it is also possible to use folders, no need for separate 
mailboxes. But there will be no difference in using mailboxes or folders, it 
just works.

No header modifications take place on a message when dragging it from folder 
into another, or from mailbox to another.

But as the OP thinks about separate mailboxes, I am afraid that is because he 
has no folders available. That must be because his users are tied to POP3.




Re: Mailbox for auto learning

2009-08-10 Thread Dan Schaefer

Stefan wrote:

Am Sonntag, 9. August 2009 07:36:54 schrieb Luis Daniel Lucio Quiroz:
  

Hi SAs,

Well, after reading this link
http://spamassassin.apache.org/full/3.2.x/doc/sa-learn.html I'm still
looking for an easy-way to let my mortal users to train our antispam.  I
was thinking a mailbox such as  h...@antispamserver and s...@antispamserver
to let users to forward their false positivos or their false netgatives. 
In isde each box (ham or spam), of course a procmail with sa-learn input

will be forwarded.

My doubts are nexts:
1. Will forwarded mails be usefull for training, I mean if spam was: From:
spa...@example.netTo: u...@mydomain,   when forwarding it will be From:
mu...@mydomain To: s...@antispamserver.   Change of this and forwarding
(getting rid of headers because mail-clients) wont change learning?

You have to forward the message as an attachment un unpack it after receiving. 
Have a look at: 
https://po2.uni-stuttgart.de/~rusjako/sal-wrappe

2. If technique in question 1 is usless, what other way would be nice to
let user to report a false positive/negative for training.


This may not be ideal, but in Thunderbird, you can drag messages between 
mailboxes. You could setup each user to have access to their own account 
and the two learning mailboxes. You can then have your users drag the 
false positives/negatives to the appropriate box. I have not testing 
this 100%, so I don't know if any headers get re-written or not.


--
Dan Schaefer
Web Developer/Systems Analyst
Performance Administration Corp.



Re: Mailbox for auto learning

2009-08-10 Thread Stefan
Am Sonntag, 9. August 2009 07:36:54 schrieb Luis Daniel Lucio Quiroz:
> Hi SAs,
>
> Well, after reading this link
> http://spamassassin.apache.org/full/3.2.x/doc/sa-learn.html I'm still
> looking for an easy-way to let my mortal users to train our antispam.  I
> was thinking a mailbox such as  h...@antispamserver and s...@antispamserver
> to let users to forward their false positivos or their false netgatives. 
> In isde each box (ham or spam), of course a procmail with sa-learn input
> will be forwarded.
>
> My doubts are nexts:
> 1. Will forwarded mails be usefull for training, I mean if spam was: From:
> spa...@example.netTo: u...@mydomain,   when forwarding it will be From:
> mu...@mydomain To: s...@antispamserver.   Change of this and forwarding
> (getting rid of headers because mail-clients) wont change learning?

You have to forward the message as an attachment un unpack it after receiving. 
Have a look at: 
https://po2.uni-stuttgart.de/~rusjako/sal-wrapper

> 2. If technique in question 1 is usless, what other way would be nice to
> let user to report a false positive/negative for training.
>
> TIA
> LD

Greetings
Stefan


Re: Mailbox for auto learning

2009-08-09 Thread Luis Daniel Lucio Quiroz
Le dimanche 9 août 2009 10:56:59, Benny Pedersen a écrit :
> On Sun, 9 Aug 2009 00:36:54 -0500, Luis Daniel Lucio Quiroz
>
> > 1. Will forwarded mails be usefull for training, I mean if spam was:
>
> From:
> > spa...@example.netTo: u...@mydomain,   when forwarding it will be
> > From:
> > mu...@mydomain To: s...@antispamserver.   Change of this and forwarding
> > (getting rid of headers because mail-clients) wont change learning?
> >
> > 2. If technique in question 1 is usless, what other way would be nice to
> > let
> > user to report a false positive/negative for training.
>
> dovecot-antispam solves it with dovecot
>
> all users need to do is move mail in imap to junk folder, in that task
> dovecot-antispam call sa-learn
>
> this means no junk plugins to windows clients
>
> and last but not least no header changes
>
> mail that is moved out of the junk folder is learned as ham, intuitive
> like an amiga :)

Yes but worst scenario is best for me.  POP users with  MS outlook.

Then I was wondering to add with altermime somethin like this at footer:

"if you think this mail is spam please click here" (also for ham), and "here" 
is a link with message-id (i have a CC of all mails).

So, other doutbt, altering mail by adding a footer will alter SA learning?


Re: Mailbox for auto learning

2009-08-09 Thread Benny Pedersen
On Sun, 9 Aug 2009 00:36:54 -0500, Luis Daniel Lucio Quiroz

> 1. Will forwarded mails be usefull for training, I mean if spam was:
From: 
> spa...@example.netTo: u...@mydomain,   when forwarding it will be
> From: 
> mu...@mydomain To: s...@antispamserver.   Change of this and forwarding 
> (getting rid of headers because mail-clients) wont change learning?
> 
> 2. If technique in question 1 is usless, what other way would be nice to
> let 
> user to report a false positive/negative for training. 

dovecot-antispam solves it with dovecot

all users need to do is move mail in imap to junk folder, in that task
dovecot-antispam call sa-learn

this means no junk plugins to windows clients

and last but not least no header changes

mail that is moved out of the junk folder is learned as ham, intuitive
like an amiga :)


-- 
Benny Pedersen


Re: Mailbox for auto learning

2009-08-09 Thread RW
On Sun, 9 Aug 2009 00:36:54 -0500
Luis Daniel Lucio Quiroz  wrote:

> Hi SAs,
> 
> Well, after reading this link 
> http://spamassassin.apache.org/full/3.2.x/doc/sa-learn.html I'm still
> looking for an easy-way to let my mortal users to train our
> antispam.

If your users use webmail, imap etc , the most convenient approach is to
have folders for learning spam and ham.


Re: Mailbox for auto learning

2009-08-09 Thread Luis Daniel Lucio Quiroz
Le dimanche 9 août 2009 06:52:49, vous avez écrit :
> Luis Daniel Lucio Quiroz wrote:
> > Hi SAs,
> >
> > Well, after reading this link
> > http://spamassassin.apache.org/full/3.2.x/doc/sa-learn.html I'm still
> > looking for an easy-way to let my mortal users to train our antispam.  I
> > was thinking a mailbox such as  h...@antispamserver and
> > s...@antispamserver to let users to forward their false positivos or
> > their false netgatives.  In isde each box (ham or spam), of course a
> > procmail with sa-learn input will be forwarded.
> >
> > My doubts are nexts:
> > 1. Will forwarded mails be usefull for training, I mean if spam was:
> > From: spa...@example.netTo: u...@mydomain,   when forwarding it will
> > be From: mu...@mydomain To: s...@antispamserver.   Change of this and
> > forwarding (getting rid of headers because mail-clients) wont change
> > learning?
>
> Forwarded mails are NOT useful.
>
> You also neglected to mention the change of Received headers, and pretty
> much every header in the message, the re-encoding of the body by your
> mail client, etc.
>
> Since SA's bayes tokenizes headers, that's disastrous.
>
> > 2. If technique in question 1 is usless, what other way would be nice to
> > let user to report a false positive/negative for training.
>
> In some cases you can have the client forward as attachment, and use a
> mailbox that strips attachments and feeds them to sa-learn. As long as
> the client being used forwards the entire original message, with
> complete headers, this should work fine.
>
> > TIA
> >
> > LD

I understand

and if I use altemime to add a link, to identify email in a quarantine?  will 
tex in altermime change learning?


Re: Mailbox for auto learning

2009-08-09 Thread Matt Kettler
Luis Daniel Lucio Quiroz wrote:
> Hi SAs,
>
> Well, after reading this link 
> http://spamassassin.apache.org/full/3.2.x/doc/sa-learn.html I'm still looking 
> for an easy-way to let my mortal users to train our antispam.  I was thinking 
> a mailbox such as  h...@antispamserver and s...@antispamserver to let users 
> to 
> forward their false positivos or their false netgatives.  In isde each box 
> (ham or spam), of course a procmail with sa-learn input will be forwarded.  
>
> My doubts are nexts:
> 1. Will forwarded mails be usefull for training, I mean if spam was: From: 
> spa...@example.netTo: u...@mydomain,   when forwarding it will be From: 
> mu...@mydomain To: s...@antispamserver.   Change of this and forwarding 
> (getting rid of headers because mail-clients) wont change learning?
>   
Forwarded mails are NOT useful.

You also neglected to mention the change of Received headers, and pretty
much every header in the message, the re-encoding of the body by your
mail client, etc.

Since SA's bayes tokenizes headers, that's disastrous.
> 2. If technique in question 1 is usless, what other way would be nice to let 
> user to report a false positive/negative for training.
In some cases you can have the client forward as attachment, and use a
mailbox that strips attachments and feeds them to sa-learn. As long as
the client being used forwards the entire original message, with
complete headers, this should work fine.

>   
>
> TIA
>
> LD
>
>
>   



Mailbox for auto learning

2009-08-08 Thread Luis Daniel Lucio Quiroz
Hi SAs,

Well, after reading this link 
http://spamassassin.apache.org/full/3.2.x/doc/sa-learn.html I'm still looking 
for an easy-way to let my mortal users to train our antispam.  I was thinking 
a mailbox such as  h...@antispamserver and s...@antispamserver to let users to 
forward their false positivos or their false netgatives.  In isde each box 
(ham or spam), of course a procmail with sa-learn input will be forwarded.  

My doubts are nexts:
1. Will forwarded mails be usefull for training, I mean if spam was: From: 
spa...@example.netTo: u...@mydomain,   when forwarding it will be From: 
mu...@mydomain To: s...@antispamserver.   Change of this and forwarding 
(getting rid of headers because mail-clients) wont change learning?

2. If technique in question 1 is usless, what other way would be nice to let 
user to report a false positive/negative for training.  

TIA

LD


Re: Spam auto-learning by "message resending"

2006-05-11 Thread Stuart Johnston

Jerome Delamarche wrote:

Hi,

I'm configuring SA and I'm looking for an easy way for the end users to
improve their own Bayesian filters.

Users do not have interactive account on the Linux servers. They cannot use
"sa-learn" or any other Linux tools.
It could be fine if they could automatically resend to their own mailbox
spams not been filtered by SA.

SA could (?) determine it has already analyzed the message and automatically
consider it was a previous spam.
Then it could use the "auto-learn" feature to add it to the user spam
database ?

Or is there another way to do it ?


If your users can use IMAP, you can create a special folder where they
copy spam messages.  The Linux server can sa-learn from these folders.

Or, you can use a system on the Linux server, such as Maia Mailguard,
that temporarily stores all message on the server and provides a
web-interface for user training.

Another option is to provide a special address that users forward spam 
messages to.  The main problem here is that the message must be 
forwarded as an attachment in a way that a script on the Linux server 
can extract the attachment and get something reasonably close to the 
original spam.  Thunderbird does a pretty good job with this.  Outlook, 
not so much.


-Stuart




Re: Spam auto-learning by "message resending"

2006-05-11 Thread Stuart Johnston

Jerome Delamarche wrote:

Hi,

I'm configuring SA and I'm looking for an easy way for the end users to
improve their own Bayesian filters.

Users do not have interactive account on the Linux servers. They cannot use
"sa-learn" or any other Linux tools.
It could be fine if they could automatically resend to their own mailbox
spams not been filtered by SA.

SA could (?) determine it has already analyzed the message and automatically
consider it was a previous spam.
Then it could use the "auto-learn" feature to add it to the user spam
database ?

Or is there another way to do it ?


If your users can use IMAP, you can create a special folder where they 
copy spam messages.  The Linux server can sa-learn from these folders.


Or, you can use a system on the Linux server, such as Maia Mailguard, 
that temporarily stores all message on the server and provides a 
web-interface for user training.


Another option is to provide a special address


Spam auto-learning by "message resending"

2006-05-11 Thread Jerome Delamarche
Hi,

I'm configuring SA and I'm looking for an easy way for the end users to
improve their own Bayesian filters.

Users do not have interactive account on the Linux servers. They cannot use
"sa-learn" or any other Linux tools.
It could be fine if they could automatically resend to their own mailbox
spams not been filtered by SA.

SA could (?) determine it has already analyzed the message and automatically
consider it was a previous spam.
Then it could use the "auto-learn" feature to add it to the user spam
database ?

Or is there another way to do it ?

Jerome




RE: Ham not auto-learning?

2005-08-19 Thread Matthew Yette
That sounds about right. I did get those thresholds from somewhere on
this list, though, I believe. No biggie. Bayes has been pretty spot on
so far (I can post the rules chart if anyone is interested.), so I'm
pretty confident in allowing it to continue to learn.

Thanks for your help. 


--
Matthew Yette
Senior Engineer - NOC/Operations
MA Polce Consulting, Inc.
[EMAIL PROTECTED]
315-838-1644 (w)
315-356-0597 (f)
AIM/Yahoo: MAPolceNOC
MSN: [EMAIL PROTECTED]
-Original Message-
From: Craig McLean [mailto:[EMAIL PROTECTED] 
Sent: Friday, August 19, 2005 2:31 PM
To: Matthew Yette
Cc: users@spamassassin.apache.org
Subject: Re: Ham not auto-learning?

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Matthew Yette wrote:
| Running the sa-stats.pl version 0.9 that produces a chart with stats 
| on what rules are hit for spam and ham most frequently, I notice that 
| of all 13,411 autolearns performed, every one of them was for spam. 
| Ham has 0 messages autolearned. Wouldn't, for example, a message that 
| comes in and has been whitelisted (and therefore scoring ~ -100) be
autolearned?
| My bayes thresholds are set for 12.1 (spam) and -12.0(ham).

Matthew,
If I recall correctly, bayes learning thresholds are compared against a
message score *before* whitelist adjustments are made, so unless a
message scores -12 using just the standard rules (unlikely) it will
never be learned as ham. Just set the ham threshold to 0 and you'll see
any message hitting no positive scoring tests being learned as ham.

Regards,
Craig.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQFDBiVFMDDagS2VwJ4RAkBVAJ9IHh/KpJ3uZRG+pZYQ7Mo77cPiaQCgvEOw
F4d9wRpAt5ZHl2jHGfSE7RQ=
=cXb8
-END PGP SIGNATURE-


Re: Ham not auto-learning?

2005-08-19 Thread Steve Martin

I'm going to guess that whitelist isn't taken into consideration.

-12 for autolearning of ham is pretty extreme, I'm not surprised you  
aren't seeing any autolearning.  The default is .1


On Aug 19, 2005, at 1:24 PM, Matthew Yette wrote:

Running the sa-stats.pl version 0.9 that produces a chart with  
stats on

what rules are hit for spam and ham most frequently, I notice that of
all 13,411 autolearns performed, every one of them was for spam.  
Ham has

0 messages autolearned. Wouldn't, for example, a message that comes in
and has been whitelisted (and therefore scoring ~ -100) be  
autolearned?

My bayes thresholds are set for 12.1 (spam) and -12.0(ham).

--
Matthew Yette
Senior Engineer - NOC/Operations
MA Polce Consulting, Inc.
[EMAIL PROTECTED]
315-838-1644 (w)
315-356-0597 (f)
AIM/Yahoo: MAPolceNOC
MSN: [EMAIL PROTECTED]



--
Steve Martin  http://www.cheezmo.com/
Smart Calibration, LLC   http://www.smartcalibration.com/
The Widescreen Movie Centerhttp://www.widemovies.com/
Letterboxed Movie TV Schedule  http://www.widemovies.com/lbx.html



Re: Ham not auto-learning?

2005-08-19 Thread Craig McLean

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Matthew Yette wrote:
| Running the sa-stats.pl version 0.9 that produces a chart with stats on
| what rules are hit for spam and ham most frequently, I notice that of
| all 13,411 autolearns performed, every one of them was for spam. Ham has
| 0 messages autolearned. Wouldn't, for example, a message that comes in
| and has been whitelisted (and therefore scoring ~ -100) be autolearned?
| My bayes thresholds are set for 12.1 (spam) and -12.0(ham).

Matthew,
If I recall correctly, bayes learning thresholds are compared against a
message score *before* whitelist adjustments are made, so unless a
message scores -12 using just the standard rules (unlikely) it will
never be learned as ham. Just set the ham threshold to 0 and you'll see
any message hitting no positive scoring tests being learned as ham.

Regards,
Craig.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQFDBiVFMDDagS2VwJ4RAkBVAJ9IHh/KpJ3uZRG+pZYQ7Mo77cPiaQCgvEOw
F4d9wRpAt5ZHl2jHGfSE7RQ=
=cXb8
-END PGP SIGNATURE-


Ham not auto-learning?

2005-08-19 Thread Matthew Yette
Running the sa-stats.pl version 0.9 that produces a chart with stats on
what rules are hit for spam and ham most frequently, I notice that of
all 13,411 autolearns performed, every one of them was for spam. Ham has
0 messages autolearned. Wouldn't, for example, a message that comes in
and has been whitelisted (and therefore scoring ~ -100) be autolearned?
My bayes thresholds are set for 12.1 (spam) and -12.0(ham).

--
Matthew Yette
Senior Engineer - NOC/Operations
MA Polce Consulting, Inc.
[EMAIL PROTECTED]
315-838-1644 (w)
315-356-0597 (f)
AIM/Yahoo: MAPolceNOC
MSN: [EMAIL PROTECTED]


Re: Testing Bayes (auto)-learning

2005-03-19 Thread Matt Kettler
Greg Abbas wrote:

>Paul Boven  chello.nl> writes:
>  
>
>>Yes, they're forwarding the messages as attachements, and yes, I'm 
>>stripping them out of the message/rfc822 attachements before feeding 
>>them to Bayes. And in all the tests I've done so far this seems to work, 
>>but now that we've upgraded to SA3.0.2 I can't peek 'under the hood' 
>>anymore to see if things are still being learned as they should.
>>
>>
>
>On a related note, if I grab messages from a maildir after
>spamassassin has "quarantined" them ("The original message has
>been attached to this so you can view it... yadda yadda") is
>sa-learn smart enough to realize that the spam is contained in
>the attachment? 
>  
>

sa-learn is smart enough to undo any changes made by spamassassin
itself, so if you use SA to do your tagging, sa-learn will undo it prior
to learning.

However, if you use a tool like amavis, mimedefang, or mailscanner and
use that tool's own encapsulation methods instead of SA's, then sa-learn
won't undo it.



Re: Testing Bayes (auto)-learning

2005-03-19 Thread Greg Abbas
Paul Boven  chello.nl> writes:
> Yes, they're forwarding the messages as attachements, and yes, I'm 
> stripping them out of the message/rfc822 attachements before feeding 
> them to Bayes. And in all the tests I've done so far this seems to work, 
> but now that we've upgraded to SA3.0.2 I can't peek 'under the hood' 
> anymore to see if things are still being learned as they should.

On a related note, if I grab messages from a maildir after
spamassassin has "quarantined" them ("The original message has
been attached to this so you can view it... yadda yadda") is
sa-learn smart enough to realize that the spam is contained in
the attachment? Or is this the same situation as a user-forward,
where I would need to write something to strip it out?

And as an aside, I'm curious about "peeking under the hood" too,
but in my case it's because I'm curious how many messages have
been trained. (In order to find out how soon the filter is going
to think the corpus is large enough to start using its bayes
rules.)

TIA. -g.




Re: Testing Bayes (auto)-learning

2005-03-17 Thread Paul Boven
Hi Daryl, everyone,
Daryl C. W. O'Shea wrote:
Paul Boven wrote:

My problem is that I have end-users that are basically claiming 'the 
more I send to the relearn-address, the lower the Bayes score seems to 
be getting.' The included headers seem to support that claim, so I 
really want to dig a bit deeper into the whole setup.

That there sounds like your problem.  How are your users sending mail to 
the 'relearn address'?  If they're not forwarding messages as an 
attachment, and you're not striping out these attached messages then it 
isn't going to work to your benefit, and you'll see the result you 
describe.
Yes, they're forwarding the messages as attachements, and yes, I'm 
stripping them out of the message/rfc822 attachements before feeding 
them to Bayes. And in all the tests I've done so far this seems to work, 
but now that we've upgraded to SA3.0.2 I can't peek 'under the hood' 
anymore to see if things are still being learned as they should.

Regards, Paul Boven.


Re: Testing Bayes (auto)-learning

2005-03-17 Thread Daryl C. W. O'Shea
Paul Boven wrote:
My problem is that I have end-users that are basically claiming 'the 
more I send to the relearn-address, the lower the Bayes score seems to 
be getting.' The included headers seem to support that claim, so I 
really want to dig a bit deeper into the whole setup.
That there sounds like your problem.  How are your users sending mail to 
the 'relearn address'?  If they're not forwarding messages as an 
attachment, and you're not striping out these attached messages then it 
isn't going to work to your benefit, and you'll see the result you describe.

Daryl


Testing Bayes (auto)-learning

2005-03-17 Thread Paul Boven
Hi everyone,
There seem to be some learning-problems with our Bayes database which 
I'm trying to track down.

Given a particular spam-message that got auto-trained as ham, then 
re-trained as spam, I would like to be able to do the following:

1.) Make sure whether it's in the Bayes database or not, and whether it 
is there as ham or as spam. I can use Berkeley's tools to dump the 
bayes_seen database, but often the message-ID isn't in there even though 
the message got learned; probably with a '@sa-generated' message-ID.

Given the original message, how can I determine which Message-ID Bayes 
is using to keep track o fthe message? When will it accept the original 
Message-ID, and when will it use the generated one? How can I determine 
the sa-generated Message-ID without running it trough the learner again?

How sensitive is the generated Message-ID to changes in Received: and 
other headers that happen when the mail gets returned to the learner?

2.) With the new SpamAssassin 3.0.2, I can no longer see what score a 
particular token has, because they are hashed. Is there an easy way to 
generate these hashes or is there an interface that I can use to check 
the score for a token?

My problem is that I have end-users that are basically claiming 'the 
more I send to the relearn-address, the lower the Bayes score seems to 
be getting.' The included headers seem to support that claim, so I 
really want to dig a bit deeper into the whole setup.

Regards, Paul Boven.



Is auto-learning working?

2005-03-08 Thread Michel . PETIT
Hi,

I'm new to spamassassin. I installed it on a Solaris 9 system, and it 
works fine.
But there is a thing I don't understand, I configured the auto-learning, 
but when I run spamd it doesn't create the bayes_* files.
If I run sa-learn, then the files are created.
How can I know if auto-learning is working or not ?
What I forgot ?

My configuration :

# spamd --version
SpamAssassin Server version 3.0.2
  running on Perl 5.8.3

# cat /etc/mail/spamassassin/local.cf
required_hits 5
rewrite_header Subject  SPAM

report_safe 0
skip_rbl_checks 1

# Enable the Bayes system
use_bayes   1

# Enable Bayes auto-learning
bayes_auto_learn  1
bayes_path  /etc/mail/spamassassin/bayes
bayes_file_mode 0666


Thanks in advance.

Greetings.

-- 
Michel



This e-mail, any attachments and the information contained therein ("this 
message") are confidential and intended solely for the use of the addressee(s). 
If you have received this message in error please send it back to the sender 
and delete it. Unauthorized publication, use, dissemination or disclosure of 
this message, either in whole or in part is strictly prohibited.
** 
Ce message électronique et tous les fichiers joints ainsi que  les informations 
contenues dans ce message ( ci après "le message" ), sont confidentiels et 
destinés exclusivement à l'usage de la  personne à laquelle ils sont adressés. 
Si vous avez reçu ce message par erreur, merci  de le renvoyer à son émetteur 
et de le détruire. Toutes diffusion, publication, totale ou partielle ou 
divulgation sous quelque forme que se soit non expressément autorisées de ce 
message, sont interdites.
** 



RE: Potential new auto-learning strategy

2005-03-02 Thread Gray, Richard



For various reasons (some political, some technical) we 
don't use bayes here. It can be very frustrating, but I'm sure you guys know 
what its like to have your hands tied by corporate 
wrangling.
 
The reason I proposed a more complex logic than the one 
you suggest was to handle down-scoring rules that performed poorly as well as 
up-scoring effective rules. By using a fixed score, you run the risk either setting it too low and the system taking too long to learn, or too high (it has 
been demonstrated that this can cause chaotic behaviour in some systems). By 
using a function that calculates X based on the overall score of the message, 
the other rules hit, and diminished by the learn rate, the system can quickly 
cover the large gap, but when the distance between the two scores becomes small, 
the changes to the score values are appropriate small, tending the system towards stability (assume spammers don't change tactic)
 
Should 2 particular rules occur commonly together, this 
would also have the effect of balancing out score changes across them both, relative to their base values.
 
I'd like to get into doing this, but work is swamped (I 
don't get to play with spam all day :( ). If there are other people keen on doing this then maybe we can get a collaboration going. 
 
R


From: Chris Santerre 
[mailto:[EMAIL PROTECTED] Sent: 02 March 2005 
15:16To: Gray, Richard; 
users@spamassassin.apache.orgSubject: RE: Potential new auto-learning 
strategy

There has 
been a lot of talk about dynamic scoring. Most people argue that Bayes is a good 
substitute for it already. But not if you don't use Bayes ;) 

 
I think its a 
worthy idea for testing. Although the logic could be fairly simple. Like using 
the top hitting rules script in a cron job. pulling out the N'th top rules and 
adding X points to them based on the hits. Thats something I've wanted to play 
with, but had no time. 
 
--Chris 


  -Original Message-From: Gray, Richard 
  [mailto:[EMAIL PROTECTED]Sent: Wednesday, March 02, 2005 7:03 
  AMTo: users@spamassassin.apache.orgSubject: Potential 
  new auto-learning strategy
  I 
  saw an article a while back about some DJs who were using perl as a mixing 
  tool by writing perl code that edited itself while it ran in a loop. I thought 
  this was kind of cool. 
   
  I 
  studied AI at university, and remember a good bit of discussion regarding   feedback systems.
   
  So, to combine the two, I was thinking of how to use SA in a similar 
  structure, and propose a dynamic weighting system for SA rules. Consider the 
  scores that a base installation of SA gives to its rules, but when shown   messages to learn from, it modifies the score weighting of the rules rather 
  than the bayes system.
   
  I'll not throw out a discussion regarding learning rates and so, but I 
  can imagine the logic being loosely based on how much influence the rule had 
  on the total score, the distance of the final result from the spam/ham 
  boundary, and the learning rate chosen by the 
  administrator.
   
  Any feedback?
   
  R---This 
  email from dns has been validated by dnsMSS Managed Email Security and is free 
  from all known viruses.For further information contact 
  [EMAIL PROTECTED]

---
This email from dns has been validated by dnsMSS Managed Email Security and is free from all known viruses.

For further information contact [EMAIL PROTECTED]







RE: Potential new auto-learning strategy

2005-03-02 Thread Chris Santerre



There has 
been a lot of talk about dynamic scoring. Most people argue that Bayes is a good 
substitute for it already. But not if you don't use Bayes ;) 

 
I think its a 
worthy idea for testing. Although the logic could be fairly simple. Like using 
the top hitting rules script in a cron job. pulling out the N'th top rules and 
adding X points to them based on the hits. Thats something I've wanted to play 
with, but had no time. 
 
--Chris 


  -Original Message-From: Gray, Richard 
  [mailto:[EMAIL PROTECTED]Sent: Wednesday, March 02, 2005 7:03 
  AMTo: users@spamassassin.apache.orgSubject: Potential 
  new auto-learning strategy
  I 
  saw an article a while back about some DJs who were using perl as a mixing 
  tool by writing perl code that edited itself while it ran in a loop. I thought 
  this was kind of cool. 
   
  I 
  studied AI at university, and remember a good bit of discussion regarding 
  feedback systems.
   
  So, to combine the two, I was thinking of how to use SA in a similar 
  structure, and propose a dynamic weighting system for SA rules. Consider the 
  scores that a base installation of SA gives to its rules, but when shown 
  messages to learn from, it modifies the score weighting of the rules rather 
  than the bayes system.
   
  I'll not throw out a discussion regarding learning rates and so, but I 
  can imagine the logic being loosely based on how much influence the rule had 
  on the total score, the distance of the final result from the spam/ham 
  boundary, and the learning rate chosen by the 
  administrator.
   
  Any feedback?
   
  R---This 
  email from dns has been validated by dnsMSS Managed Email Security and is free 
  from all known viruses.For further information contact 
  [EMAIL PROTECTED]


Potential new auto-learning strategy

2005-03-02 Thread Gray, Richard



I 
saw an article a while back about some DJs who were using perl as a mixing tool 
by writing perl code that edited itself while it ran in a loop. I thought this 
was kind of cool. 
 
I 
studied AI at university, and remember a good bit of discussion regarding feedback systems.
 
So, 
to combine the two, I was thinking of how to use SA in a similar structure, and 
propose a dynamic weighting system for SA rules. Consider the scores that a base 
installation of SA gives to its rules, but when shown messages to learn from, it 
modifies the score weighting of the rules rather than the bayes 
system.
 
I'll 
not throw out a discussion regarding learning rates and so, but I can imagine 
the logic being loosely based on how much influence the rule had on the total 
score, the distance of the final result from the spam/ham boundary, and the learning rate chosen by the administrator.
 
Any 
feedback?
 
R

---
This email from dns has been validated by dnsMSS Managed Email Security and is free from all known viruses.

For further information contact [EMAIL PROTECTED]







RE: Auto learning

2005-02-22 Thread Paul J. Smith
Hi,

required_hits 7
report_safe 0
rewrite_header Subject [SPAM]
bayes_auto_learn 1
skip_rbl_checks 0
use_razor2 1
use_dcc 1
use_pyzor 0

dns_available yes

I think I may have just sust this. I just found a bayes db in
/home/root/.spamassassin, whereas I have been testing things logged in a
root and was looking at /root/.spamassassin.It is being updated!  I
was running things as root, so it was picking up a different database.

So now I have

-rw---  1 spamd spamd 1.3M Feb 22 15:51 auto-whitelist
-rw---  1 spamd spamd 3.6K Feb 22 15:51 bayes_journal
-rw---  1 spamd spamd 652K Feb 22 15:51 bayes_seen
-rw---  1 spamd spamd 5.3M Feb 22 15:51 bayes_toks

in my /home/spamd/.spamassassin folder

If I run

 sa-learn -D --sync --dbpath /home/spamd/.spamassassin

I still see 

debug: bayes: 25894 tie-ing to DB file R/O
/root/.spamassassin/bayes_toks
debug: bayes: 25894 tie-ing to DB file R/O
/root/.spamassassin/bayes_seen
debug: bayes: found bayes db version 3
debug: bayes: Not available for scanning, only 0 spam(s) in Bayes DB <
200
debug: bayes: 25894 untie-ing
debug: bayes: 25894 untie-ing db_toks
debug: bayes: 25894 untie-ing db_seen
debug: Score set 0 chosen.
debug: Initialising learner
debug: Syncing Bayes and expiring old tokens...
debug: lock: 25894 created
/home/spamd/.spamassassin/bayes.lock.localhost.localdomain.25894
debug: lock: 25894 trying to get lock on /home/spamd/.spamassassin/bayes
with 0 retries
debug: lock: 25894 link to /home/spamd/.spamassassin/bayes.lock: link ok
debug: bayes: 25894 tie-ing to DB file R/W
/home/spamd/.spamassassin/bayes_toks
debug: bayes: 25894 tie-ing to DB file R/W
/home/spamd/.spamassassin/bayes_seen
debug: bayes: found bayes db version 3
debug: refresh: 25894 refresh /home/spamd/.spamassassin/bayes.lock
debug: refresh: 25894 refresh /home/spamd/.spamassassin/bayes.lock
synced Bayes databases from journal in 3 seconds: 1545 unique entries
(1940 total entries)
debug: refresh: 25894 refresh /home/spamd/.spamassassin/bayes.lock
debug: refresh: 25894 refresh /home/spamd/.spamassassin/bayes.lock
debug: Syncing complete.
debug: bayes: 25894 untie-ing
debug: bayes: 25894 untie-ing db_toks
debug: bayes: 25894 untie-ing db_seen
debug: bayes: files locked, now unlocking lock
debug: unlock: 25894 unlink /home/spamd/.spamassassin/bayes.lock

I don't understand that even though I specified the db path, it still
has /root/./spamassassin mentioned as well.  Does it try to use both
databases?  It seems to see both databases.

I am seeing some bayes scoring now as well.

If I am using sa-learn, can I just add the --dbpath
/home/spamd/.spamassassin option and it should update the correct db?



Thanks for all the help! 

 

> -Original Message-
> From: Richard Ozer [mailto:[EMAIL PROTECTED] 
> Sent: 22 February 2005 15:19
> To: Paul J. Smith
> Cc: users@spamassassin.apache.org
> Subject: Re: Auto learning
> 
> Can you post your local.cf?
> 
> Paul J. Smith wrote:
> > Still nothing.  I set the owner on the bayes dbs to 'spamd' 
> which is the user the process is running under.  I also set 
> og+rw.  Left overnight, no change.  Only 2 hams, depsite the 
> autolearn having picked loads of hams out of the feed with 
> 'autolearn=spam/ham'.  I've just deleted the databases with 
> 'sa-learn --clear' the a 'sa-learn --sync' and reset the 
> permissons again to spamd.  Still nothing is getting added 
> though and I can't see any error messages, even in debug mode.
> > 
> > The output from sa-learn --sync -D is
> > 
> > [EMAIL PROTECTED] .spamassassin]# sa-learn -D --sync
> > debug: SpamAssassin version 3.0.2


RE: Auto learning

2005-02-22 Thread Paul J. Smith
Thanks.  I am running 'sa-learn' as root.  But you've given me an idea.
Maybe it's looking in home\spamd for them when running user that user
and in /root/./spamassassin when I'm running as root?  I've just
specified the path to bayes in local.cf, so we'll see if that makes any
difference.




From: Andy Jezierski [mailto:[EMAIL PROTECTED] 
Sent: 22 February 2005 15:19
To: users@spamassassin.apache.org
Subject: RE: Auto learning



"Paul J. Smith" <[EMAIL PROTECTED]> wrote on 02/22/2005 01:41:28
AM:

> Still nothing.  I set the owner on the bayes dbs to 'spamd'
which is
> the user the process is running under.  I also set og+rw.
Left 
> overnight, no change.  Only 2 hams, depsite the autolearn
having 
> picked loads of hams out of the feed with
'autolearn=spam/ham'.  
> I've just deleted the databases with 'sa-learn --clear' the a
'sa-
> learn --sync' and reset the permissons again to spamd.  Still 
> nothing is getting added though and I can't see any error
messages, 
> even in debug mode.
> 
> The output from sa-learn --sync -D is
> 
> [EMAIL PROTECTED] .spamassassin]# sa-learn -D --sync

[snip] 

> debug: bayes: 25498 tie-ing to DB file R/O
/root/.spamassassin/bayes_toks
> debug: bayes: 25498 tie-ing to DB file R/O
/root/.spamassassin/bayes_seen
> debug: bayes: found bayes db version 3
> debug: bayes: Not available for scanning, only 0 spam(s) in
Bayes DB < 200

[snip] 

> Can anyone see anything wrong with this?
> 
> I'm starting spamd with "-d -c -m5 -H -i 0.0.0.0 -A
192.168.0.0/24 -s local5"
> 
> Can't understand how I got 2 hams in there in the first place!
> 
> Thanks.

Are you sure you're using the correct bayes files?  In the debug
above, it shows the bayes files in /root/.spamassassin yet you say that
you're running sa under the spamd userid.  On my system, my bayes files
for the spamd userid are in /home/spamd/.spamassassin. 

May want to check that. 

Andy 


--
No virus found in this incoming message.
Checked by AVG Anti-Virus.
Version: 7.0.300 / Virus Database: 266.3.0 - Release Date:
21/02/2005




RE: Auto learning

2005-02-22 Thread Andy Jezierski

"Paul J. Smith" <[EMAIL PROTECTED]>
wrote on 02/22/2005 01:41:28 AM:

> Still nothing.  I set the owner on the bayes dbs to 'spamd' which
is
> the user the process is running under.  I also set og+rw.  Left

> overnight, no change.  Only 2 hams, depsite the autolearn having

> picked loads of hams out of the feed with 'autolearn=spam/ham'.  
> I've just deleted the databases with 'sa-learn --clear' the a 'sa-
> learn --sync' and reset the permissons again to spamd.  Still

> nothing is getting added though and I can't see any error messages,

> even in debug mode.
> 
> The output from sa-learn --sync -D is
> 
> [EMAIL PROTECTED] .spamassassin]# sa-learn -D --sync

[snip]

> debug: bayes: 25498 tie-ing to DB file R/O /root/.spamassassin/bayes_toks
> debug: bayes: 25498 tie-ing to DB file R/O /root/.spamassassin/bayes_seen
> debug: bayes: found bayes db version 3
> debug: bayes: Not available for scanning, only 0 spam(s) in Bayes
DB < 200

[snip]

> Can anyone see anything wrong with this?
> 
> I'm starting spamd with "-d -c -m5 -H -i 0.0.0.0 -A 192.168.0.0/24
-s local5"
> 
> Can't understand how I got 2 hams in there in the first place!
> 
> Thanks.

Are you sure you're using the correct bayes files?
 In the debug above, it shows the bayes files in /root/.spamassassin
yet you say that you're running sa under the spamd userid.  On my
system, my bayes files for the spamd userid are in /home/spamd/.spamassassin.

May want to check that.

Andy

Re: Auto learning

2005-02-22 Thread Richard Ozer
-ing to DB file R/W /root/.spamassassin/bayes_toks
debug: bayes: 25498 tie-ing to DB file R/W /root/.spamassassin/bayes_seen
debug: bayes: found bayes db version 3
debug: refresh: 25498 refresh /root/.spamassassin/bayes.lock
debug: Syncing complete.
debug: bayes: 25498 untie-ing
debug: bayes: 25498 untie-ing db_toks
debug: bayes: 25498 untie-ing db_seen
debug: bayes: files locked, now unlocking lock
debug: unlock: 25498 unlink /root/.spamassassin/bayes.lock
Can anyone see anything wrong with this?
I'm starting spamd with "-d -c -m5 -H -i 0.0.0.0 -A 192.168.0.0/24 -s local5"
Can't understand how I got 2 hams in there in the first place!
Thanks.
 

-Original Message-
From: Richard Ozer [mailto:[EMAIL PROTECTED] 
Sent: 21 February 2005 21:58
To: Paul J. Smith
Cc: users@spamassassin.apache.org
Subject: Re: Auto learning

I had a similar issue and noticed that my bayes database files did not have the proper 
owner or permissions.  That prevented auto learning from functioning.

RO
Paul J. Smith wrote:
Still setting up spamassassin.  I've got it running and auto learning is 
enabled.  It's been running all yesterday and over night.  I can see it 
has tried to auto learn a lot of ham/spam and I've fed it a load of spam 
as well.  Bayes doesn't seem to have kicked in though and if I do a 
sa-learn --sync -D I can see there are only 2 hams in there

debug: bayes: 6344 tie-ing to DB file R/O /root/.spamassassin/bayes_toks
debug: bayes: 6344 tie-ing to DB file R/O /root/.spamassassin/bayes_seen
debug: bayes: found bayes db version 3
debug: bayes: Not available for scanning, only 2 ham(s) in Bayes DB < 200
debug: bayes: 6344 untie-ing
debug: bayes: 6344 untie-ing db_toks
debug: bayes: 6344 untie-ing db_seen
debug: Score set 0 chosen.
debug: Initialising learner
It's definately autolearned far more than this.  Does it not show here?  
Do I just have to wait longer or are they being stored somwhere waiting 
for me to sa-learn them?  I'm using spamd 3.0.2 remotely.

Thanks.




RE: Auto learning

2005-02-22 Thread Paul J. Smith
s
debug: bayes: 25498 tie-ing to DB file R/W /root/.spamassassin/bayes_seen
debug: bayes: found bayes db version 3
debug: refresh: 25498 refresh /root/.spamassassin/bayes.lock
debug: Syncing complete.
debug: bayes: 25498 untie-ing
debug: bayes: 25498 untie-ing db_toks
debug: bayes: 25498 untie-ing db_seen
debug: bayes: files locked, now unlocking lock
debug: unlock: 25498 unlink /root/.spamassassin/bayes.lock

Can anyone see anything wrong with this?

I'm starting spamd with "-d -c -m5 -H -i 0.0.0.0 -A 192.168.0.0/24 -s local5"

Can't understand how I got 2 hams in there in the first place!

Thanks.


 

-Original Message-
From: Richard Ozer [mailto:[EMAIL PROTECTED] 
Sent: 21 February 2005 21:58
To: Paul J. Smith
Cc: users@spamassassin.apache.org
Subject: Re: Auto learning

I had a similar issue and noticed that my bayes database files did not have the 
proper 
owner or permissions.  That prevented auto learning from functioning.

RO

Paul J. Smith wrote:
> Still setting up spamassassin.  I've got it running and auto learning is 
> enabled.  It's been running all yesterday and over night.  I can see it 
> has tried to auto learn a lot of ham/spam and I've fed it a load of spam 
> as well.  Bayes doesn't seem to have kicked in though and if I do a 
> sa-learn --sync -D I can see there are only 2 hams in there
>  
> debug: bayes: 6344 tie-ing to DB file R/O /root/.spamassassin/bayes_toks
> debug: bayes: 6344 tie-ing to DB file R/O /root/.spamassassin/bayes_seen
> debug: bayes: found bayes db version 3
> debug: bayes: Not available for scanning, only 2 ham(s) in Bayes DB < 200
> debug: bayes: 6344 untie-ing
> debug: bayes: 6344 untie-ing db_toks
> debug: bayes: 6344 untie-ing db_seen
> debug: Score set 0 chosen.
> debug: Initialising learner
>  
> It's definately autolearned far more than this.  Does it not show here?  
> Do I just have to wait longer or are they being stored somwhere waiting 
> for me to sa-learn them?  I'm using spamd 3.0.2 remotely.
>  
> Thanks.


-- 
No virus found in this incoming message.
Checked by AVG Anti-Virus.
Version: 7.0.300 / Virus Database: 266.3.0 - Release Date: 21/02/2005
 

-- 
No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.0.300 / Virus Database: 266.3.0 - Release Date: 21/02/2005
 


  1   2   >