subject:"Bayes, Manual and Auto Learning Strategies"

Re: Bayes, Manual and Auto Learning Strategies

2014-07-02 Thread Steve Bergman




On 07/02/2014 11:12 AM, John Hardin wrote:


A week or so back they briefly listed some of the MailControl.com MTAs,
due to apparent exploits. They were quickly removed, though.


So the message here is that some DNSBL's are better than others about 
including and removing addresses quickly and responsibly. Perhaps. I 
take no position on that.


But that does not address the issue of collateral damage to users which 
share an ISP's email server with someone else who happened to get a spam 
through and reported back to the DNSBL.


Not long ago, I had another client blocked from sending response emails 
to their on-line customers about their purchases. Turned out one of the 
users on the hosting provider's system had sent some spam. Now the 
hosting provider (Webfaction) is quite responsible, very diligent, and 
has *fantastic* support. (I can recommend them for dynamic language 
language apps with no reservations.) But guess what? The DNSBL's 
interface for interacting with them was down. For over a week. (We're 
sorry, but... Please come back when... No guaranty as to...) And emails 
to the affected customers were blocked for all that time.


I use DNSBL's. But I don't like them. SA is indispensable. I like it. 
But it's a huge compilation of kluges that happen to mostly work.


Expedient. Pragmatic. Not a real solution to the actual problem.

-Steve

Re: Bayes, Manual and Auto Learning Strategies

2014-07-02 Thread Steve Bergman




On 07/02/2014 11:10 AM, Jim Popovitch wrote:


Just a heads-up... that sort of biting comment is probably not welcome


I'm familiar with adapting to the relative insularities of various 
lists. But thanks for the head-up, Jim.


-Steve

Re: Bayes, Manual and Auto Learning Strategies

2014-07-02 Thread John Hardin


On Wed, 2 Jul 2014, Axb wrote:

If a sender's IP is listed @Spamhaus , he has a serious problem reaching 
many, many destinations. If he's been expoited, you get good evidence and 
fast delisting processsing and I have yet to see a real FP with ZEN.


A week or so back they briefly listed some of the MailControl.com MTAs, 
due to apparent exploits. They were quickly removed, though.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  There is no better measure of the unthinking contempt of the
  environmentalist movement for civilization than their call to
  turn off the lights and sit in the dark.-- Sultan Knish
---
 2 days until the 238th anniversary of the Declaration of Independence

Re: Bayes, Manual and Auto Learning Strategies

2014-07-02 Thread Jim Popovitch

On Wed, Jul 2, 2014 at 11:54 AM, Steve Bergman  wrote:
>> I suggest you join the SDLU list where you can discuss anti spam
>> philosophy.
>>
>
> Thanks. I suggest that you consult for an ISP-dependent business someday.
> ;-)
>
> It's an education, too.
>
> -Steve


Just a heads-up... that sort of biting comment is probably not welcome
on the SDLU list.

-Jim P.

Re: Bayes, Manual and Auto Learning Strategies

2014-07-02 Thread Steve Bergman


I suggest you join the SDLU list where you can discuss anti spam
philosophy.



Thanks. I suggest that you consult for an ISP-dependent business 
someday. ;-)


It's an education, too.

-Steve

Re: Bayes, Manual and Auto Learning Strategies

2014-07-02 Thread Axb


On 07/02/2014 05:39 PM, Steve Bergman wrote:

On 07/02/2014 09:48 AM, Axb wrote:


If an IP is exploited/sends spam and a legitimate msg is rejected then
somebody hasn't done due diligence and I see the reject as legitimated.



The legitimate senders and receivers of the good message, neither of
whom's companies have anything to do with the spam, would not see it
that way. And I agree with their perspective. Some of the perspective
I'm reading here seem really off in the ether. I get the impression that
some are so frustrated with SA's limitations that they are willing to
resort to desperate measures which normal users would instantly
recognize as insane.

No rudeness intended. But some of the things I'm reading here are just
bizarre.


I suggest you join the SDLU list where you can discuss anti spam 
philosophy.


It's a great resource for knowledge.

List Guidelines: http://www.new-spam-l.com/admin/faq.html
List Information: https://spammers.dontlike.us/mailman/listinfo/list

The Mailop list is also a good place to lurk and bathe in hundreds of 
years of mail related experience


http://chilli.nosignal.org/mailman/listinfo/mailop

Re: Bayes, Manual and Auto Learning Strategies

2014-07-02 Thread Steve Bergman


On 07/02/2014 09:48 AM, Axb wrote:


If an IP is exploited/sends spam and a legitimate msg is rejected then
somebody hasn't done due diligence and I see the reject as legitimated.



The legitimate senders and receivers of the good message, neither of 
whom's companies have anything to do with the spam, would not see it 
that way. And I agree with their perspective. Some of the perspective 
I'm reading here seem really off in the ether. I get the impression that 
some are so frustrated with SA's limitations that they are willing to 
resort to desperate measures which normal users would instantly 
recognize as insane.


No rudeness intended. But some of the things I'm reading here are just 
bizarre.


-Steve

Re: Bayes, Manual and Auto Learning Strategies

2014-07-02 Thread Axb


On 07/02/2014 04:40 PM, Steve Bergman wrote:


You are discussing about DNSBLs but not being specific.



I'm specific in that all the DNSBL's blacklist IP addresses or blocks.
And that in today's world many, many companies share sets of mail
servers with many other companies and individuals.


If an IP is exploited/sends spam and a legitimate msg is rejected then 
somebody hasn't done due diligence and I see the reject as legitimated.


If I need to open up, I have options as the DNSWL, etc.

Re: Bayes, Manual and Auto Learning Strategies

2014-07-02 Thread Steve Bergman





You are discussing about DNSBLs but not being specific.



I'm specific in that all the DNSBL's blacklist IP addresses or blocks. 
And that in today's world many, many companies share sets of mail 
servers with many other companies and individuals.



I'll let others sell you this Hoover.


No sale necessary. I continue to recognize the overall expediency of the 
DNSBL kluge, and continue to use it myself.


I wouldn't buy a Hoover anyway. I'm a Kirby kind of guy. I have a 1969 
Dual Sanitronic 80 that my grandmother gave our family new, as a 
Christmas gift.


https://c1.staticflickr.com/7/6071/6056367963_f06f08c7f6_z.jpg

A 1976 Classic III that I picked up at a garage sale.

http://cdn3.volusion.com/maxg3.xen6j/v/vspfiles/photos/KirbyClassicIII-4.jpg?1329982229

And a really cool model 516, manufactured in 1956 that someone had set 
out on the curb for garbage pickup, which I rescued and restored.


http://www.1377731.com/kirby/516_5.jpg

All stock photos. Not mine.


-Steve

Re: Bayes, Manual and Auto Learning Strategies

2014-07-02 Thread Axb


On 07/02/2014 03:54 PM, Steve Bergman wrote:



On 07/02/2014 06:45 AM, Axb wrote:


I'm pretty sure, a huge amount of SA users trust Spamhaus' ZEN at smtp
level for outright rejects.


At this point, I'm using the defaults, other than upping BAYES_999
enough to enough to total to 5.0 when added to BAYES_99.



If a sender's IP is listed @Spamhaus , he has a serious problem reaching
many, many destinations.


Many, many destinations? Or a high percentage of destinations? I
recently had to explain to the owner of the company why an important
email from one of his business associates at another company was
blocked. I told him that they were on a couple of spam block lists
(which they were) and that contributed to the mail's rejection.

I made the same pitch. "This should affect their outgoing mail to many
sites, etc.". But I'm not sure I believe it. When I interact with people
who've had their emails rejected (often related to DNSBLs) I've been
listening for any mention of other mails of theirs to other companies
being blocked. But when the DNSBL rules in SA are the major contributors
to the rejecting, it seems that we are the only domain they interact
with which is doing so. Entries in the DNSBL databases do great
collateral damage.

And of course none of these companies are spammers. They're with this or
that ISP who has, at one time, had someone exploit their servers to send
spam.

DNSBL's are like a guy with a bazooka trying to play sniper.



You are discussing about DNSBLs but not being specific.

With millions of sessions/day I'm glad Spamhaus keeps my servers from 
melting.


I'll let others sell you this Hoover.

Re: Bayes, Manual and Auto Learning Strategies

2014-07-02 Thread Steve Bergman




On 07/02/2014 06:45 AM, Axb wrote:


I'm pretty sure, a huge amount of SA users trust Spamhaus' ZEN at smtp
level for outright rejects.


At this point, I'm using the defaults, other than upping BAYES_999 
enough to enough to total to 5.0 when added to BAYES_99.




If a sender's IP is listed @Spamhaus , he has a serious problem reaching
many, many destinations.


Many, many destinations? Or a high percentage of destinations? I 
recently had to explain to the owner of the company why an important 
email from one of his business associates at another company was 
blocked. I told him that they were on a couple of spam block lists 
(which they were) and that contributed to the mail's rejection.


I made the same pitch. "This should affect their outgoing mail to many 
sites, etc.". But I'm not sure I believe it. When I interact with people 
who've had their emails rejected (often related to DNSBLs) I've been 
listening for any mention of other mails of theirs to other companies 
being blocked. But when the DNSBL rules in SA are the major contributors 
to the rejecting, it seems that we are the only domain they interact 
with which is doing so. Entries in the DNSBL databases do great 
collateral damage.


And of course none of these companies are spammers. They're with this or 
that ISP who has, at one time, had someone exploit their servers to send 
spam.


DNSBL's are like a guy with a bazooka trying to play sniper.

-Steve

Re: Bayes, Manual and Auto Learning Strategies

2014-07-02 Thread Axb


On 07/02/2014 10:47 AM, Steve Bergman wrote:


The DNSBL's are problematic because so many ISP's mail servers are on
them. We get quite a few emails from employees at companies who's ISP's
are on Spamhaus lists, or whatever, due to nothing that has anything to
do with them.


I'm pretty sure, a huge amount of SA users trust Spamhaus' ZEN at smtp 
level for outright rejects.


If a sender's IP is listed @Spamhaus , he has a serious problem reaching 
many, many destinations. If he's been expoited, you get good evidence 
and fast delisting processsing and I have yet to see a real FP with ZEN.


Consider it being better a sender gets a hard reject than having msgs 
land in some spam folder and remain unseen.


but then...

Re: Bayes, Manual and Auto Learning Strategies

2014-07-02 Thread Axb


On 07/02/2014 10:47 AM, Steve Bergman wrote:


But for all the discussion today, we never really had a good talk about
postscreen, which is something I'd like to hear someone expound a bit upon.


probably Wrong list ... review Postfix list archives

Re: Bayes, Manual and Auto Learning Strategies

2014-07-02 Thread Axb


On 07/02/2014 10:47 AM, Steve Bergman wrote:


I'll add you to the list of people telling me that jumping out of an
airplane at 20,000 feet with nothing but a parachute and a pair of
underwear is fun.


Yep... it is...
though you could catch a cold...

Re: Bayes, Manual and Auto Learning Strategies

2014-07-02 Thread Steve Bergman




On 07/02/2014 03:05 AM, Dave Funk wrote:


Unless you've explicitly disabled them, the network based rules (razor,
pyzor, dcc, DNS based rules, RBLs, URIBLs, etc) constitute an external
'reputation' system to pass judgment on messages.


Actually, DCC is not included in the default due to arbitrary 
restrictions on request volume for the public servers. 100,000 per day 
or something. And neither is Pyzor, presumably for similar reasons? 
Razor2 is in by default.


I use all these, but have reservations about them. DCC Pyzor and Razor2 
are lists of bulk email. Not specifically of *unsolicited* bulk email. 
Many of my users are on lists of various sorts.


The DNSBL's are problematic because so many ISP's mail servers are on 
them. We get quite a few emails from employees at companies who's ISP's 
are on Spamhaus lists, or whatever, due to nothing that has anything to 
do with them.




It's not uncommon to take a low-scoring spam and find that it gets a
higher score on retest as it has been added to various bad-boy lists.


Except that the "bad-boy" lists flag more ham then spam.



This is also one way that gray-listing helps.


Review the thread. You don't want to talk to me about greylisting. ;-)

But for all the discussion today, we never really had a good talk about 
postscreen, which is something I'd like to hear someone expound a bit upon.




I've used site-wide Bayes with auto-learning at a site with ~3000 users
and have had to flush & restart our Bayes database twice in 10 years.



I'll add you to the list of people telling me that jumping out of an 
airplane at 20,000 feet with nothing but a parachute and a pair of 
underwear is fun.


-Steve

Re: Bayes, Manual and Auto Learning Strategies

2014-07-02 Thread Steve Bergman




On 07/02/2014 02:39 AM, Dave Funk wrote:


Steve,
For some reason you seem to be hung-up on Bayes "autolearning".


Skip down the thread. I was demonstrated to be wrong. :-)



It it possible that you're confusing it with "Auto-White listing"? (which is now
deprecated and has -nothing- to do with Bayes).


No. I know the difference. AWL, planned to be replaced with TxRep and 
all that. (I'd mention that TxRep has problems, but it's too late at 
night for me to engage in yet another argument.)




SA's Bayesian scorer is a system based upon a method that parses a
message, extracts 'tokens' from it and uses an algorithm to calculate a
score for the message based upon a dictionary of previously seen tokens
and their relative merit.


Yeah. Bayesian statistics is pretty cool.


or via an automated process from within SA as it scores messages
(known as 'auto' learning). So regardless of whether manual or auto
learning is utilized, tokens are added to the dictionary.


See, that's where things stop making sense to me. I would not expect the 
Bayesian filter to do any better than it's training. And if it's 
training is via input from static rules (plus DNSBL's and DCC's) I would 
not expect it to be able to do any better. And it's not hard to imagine 
pathological behavior developing. But people are telling me different. 
And I'm open to considering alternative possibilities.



It's also
possible to employ both auto & manual learning methods in the same
installation.


That would be the scenario I am considering.


There can be one dictionary used for scoring all messages processed (called
"site wide Bayes") or many separate dictionaries, one used for each
recognized user ("per user Bayes"). Either way, the dictionary(s) need to
be updated (and the update process could be either manual, auto, or both).


Yes. I've been devoted to individual fileDB's, each individually trained 
for a particular user's spam^Wemail stream. People are telling me that 
system-wide databases work well.



It's been this way for the past 10+ years AFAIK (well, maybe 10 years
ago it didn't have as many options for back-end database storage, mostly
limited to Berkeley-DB type methods).



I think it was around 2003, in SA 2.5(?) that SA got a Bayesian 
classifier. IIRC, there was a project called dspam (which I think is 
still around) For a while the dspam guys were pushing the fact that 
*dspam* was a modern spam filter, and SA was old, clunky, and too 
outdated to use.


Anyway, in the very early versions of SA Bayes, everything was 
system-wide. Later they added the option to use individual user files. 
And the only info I've seen that described autolearn and how it worked 
was a mailing list post from 2004 which specifically stated that it was 
system-wide, in memory, and was lost upon restart. Maybe that's correct 
and maybe it's not.


But today, it looks to be user-specific, if configured that way. I'm 
still working out whether I want to use it, and if so, how.


-Steve

Re: Bayes, Manual and Auto Learning Strategies

2014-07-02 Thread Dave Funk


On Wed, 2 Jul 2014, Steve Bergman wrote:

Well... I just turned on autolearn for a moment, deleted the bayes_* files on 
the test account I use, and sent myself a message from my usual outside 
account. And new bayes_* files were created. So I was wrong, and I win. More 
options.


So now I can proceed to the "what does this mean?" phase.

If I leave things as they are, then training is perfect if the users are 
diligent. But if they are not, then... what? I see plenty of spams getting 
through with a 0.0 score. IIRC, the autolearn spam threshold is 7? Pretty 
much everything there is spam.


But I'm not sure I quite buy having the static rules of SA training Bayes. 
Isn't Bayes just learning to emulate the static rules, with all their 
imperfections?


Unless you've explicitly disabled them, the network based rules (razor,
pyzor, dcc, DNS based rules, RBLs, URIBLs, etc) constitute an external
'reputation' system to pass judgment on messages.
It's not uncommon to take a low-scoring spam and find that it gets a
higher score on retest as it has been added to various bad-boy lists.

This is also one way that gray-listing helps. If you stiff-arm the first
pass of a spam run a later check may hit it more accurately as it's been
added to block-lists in the mean-time.


If it starts going wrong, doesn't that mean the errors are going to spiral 
out of control?


That is a possible risk of relying solely on auto-learning.
The autolearn system has been carefully crafted and tuned over the years
to try to prevent a feed-back loop from throwing it into a tail-spin.
For example the internal scoring system used to determine if a message
is spam or ham WRT the choice for auto-learning explicitly excludes
the Bayes score (and other particular kinds of scores such as white/black
lists) to try to prevent tail-eating.
Occasional judicious manual learning can help to 'tweak' things when Bayes
looks like it's not in top shape. (IE manual learning of FPs & FNs).

I've used site-wide Bayes with auto-learning at a site with ~3000 users
and have had to flush & restart our Bayes database twice in 10 years.

Dave

--
Dave Funk  University of Iowa
College of Engineering
319/335-5751   FAX: 319/384-0549   1256 Seamans Center
Sys_admin/Postmaster/cell_adminIowa City, IA 52242-1527
#include 
Better is not better, 'standard' is better. B{

Re: Bayes, Manual and Auto Learning Strategies

2014-07-02 Thread Steve Bergman




On 07/02/2014 02:14 AM, Axb wrote:


YOu don't need to trust me or believe me (I'm not selling anything -
just commenting on what works for me)


Well, I know you know what I meant.


Ever thought of running a newer distro in a VM, only for SA and let
spamass-milter use that?
That would mean you can play with SA 3.4 without having to redo all your
mail infra?



I'm pushing to do our ubuntu 14.04 upgrade soon to get the dovecot full 
text search. And then a memory upgrade. And these days I just max them 
out on memory. 4GB -> 32GB. Plus adding a 4TB RAID1.


So it ought to be able to handle almost anything. And I've just 
confirmed that SA 3.4 made it into 14.04.


That should, at least, avert all those annoying "time to upgrade" 
responses like I got here earlier.


It's very late here. 2:45AM, I see. But it's been a lot of fun arguing 
with you guys today. And thanks for all the help. Pyzor seems to be 
functioning fine now.


General rules of thumb to keep in mind:

Whenever there are inexplicable problems, it's probably selinux causing 
them. And if not that, regular old POSIX permissions.


And if ever there is an article of clothing you need but can't find 
anywhere in the house, there's usually a dog sleeping on it. Or possibly 
a cat.


-Steve

Re: Bayes, Manual and Auto Learning Strategies

2014-07-02 Thread Dave Funk


On Wed, 2 Jul 2014, Steve Bergman wrote:


On 07/01/2014 11:49 PM, Karsten Bräckelmann wrote:


Those do not tell you about using file or SQL based databases?


They do. But not specifically with respect to autolearn.

You never

thought about googling for "spamassassin per user" and friends? You
never checked the SA wiki?


I have, indeed. No reference to autolearn and persistent storage. The lack of 
mention is notable.


I'd expect people to be lining up to tell me I'm mistaken if I absolutely 
were.


Can you point me to a change log somewhere documenting autolearn moving from 
in-memory and system-wide to per user and persistent?


I don't hold a strong opinion on this. It would be nice if I were wrong. It 
would open more options.


I'm just waiting for evidence that it's the case. My perception is that It's 
not.


-Steve


Steve,
For some reason you seem to be hung-up on Bayes "autolearning". It it
possible that you're confusing it with "Auto-White listing"? (which is now
deprecated and has -nothing- to do with Bayes).

SA's Bayesian scorer is a system based upon a method that parses a
message, extracts 'tokens' from it and uses an algorithm to calculate a
score for the message based upon a dictionary of previously seen tokens
and their relative merit.

The dictionary is created and updated by a process called 'learning'
wherein already-classified messages are tokenized and their tokens are
stored in the dictionary along with a merit value derived from their
instance count and a factor taken from being classified as spam or ham.
This learning process can be either externally driven (known as 'manual'
learning) or via an automated process from within SA as it scores messages
(known as 'auto' learning). So regardless of whether manual or auto
learning is utilized, tokens are added to the dictionary. It's also
possible to employ both auto & manual learning methods in the same
installation.

There can be one dictionary used for scoring all messages processed (called
"site wide Bayes") or many separate dictionaries, one used for each
recognized user ("per user Bayes"). Either way, the dictionary(s) need to
be updated (and the update process could be either manual, auto, or both).

The Bayes dictionary(s) need to be stored some how, the usual method is
via some kind of database. It could be a simple file based DB, some kind
of fancy SQL server based system or something else. This is a DBA'ish kind
of choice as to what particular technology is used to store the
dictionary DB. (usually on disk in some way, could be in some kind of
memory resident set of tables, or something else???).

So you have a multi-dimensional matrix WRT your Bayes system
configuration, and manual VS auto learning is just one factor.

It's been this way for the past 10+ years AFAIK (well, maybe 10 years
ago it didn't have as many options for back-end database storage, mostly
limited to Berkeley-DB type methods).

I hope this helps you.


--
Dave Funk  University of Iowa
College of Engineering
319/335-5751   FAX: 319/384-0549   1256 Seamans Center
Sys_admin/Postmaster/cell_adminIowa City, IA 52242-1527
#include 
Better is not better, 'standard' is better. B{

Re: Bayes, Manual and Auto Learning Strategies

2014-07-02 Thread Steve Bergman




On 07/02/2014 02:02 AM, Axb wrote:


and don't count on that - they may do it the first week, new toy,
but for how long?


Not new. They'd previously been training SA with Evolution for some 
years. I have some confidence in many of them doing it right.




Also: take in mind each user's Bayes folder also get a a bayes_seen file
which grows and grows and grows and never gets truncated.


Well, I have the maximum bayes toks set at 2,000,000. Is bayes_seen 
likely to become a problem with ~100 users and 4TB of disk space?


My largest email volume user has accumulated only 320k of "seen" in 10 
days. And I assume that repeat spams don't add to it.




Do you really want to spend time watching each user's Bayes?


Not really. But I'll do whatever is necessary.

-Steve

Re: Bayes, Manual and Auto Learning Strategies

2014-07-02 Thread Axb


On 07/02/2014 09:01 AM, Steve Bergman wrote:

Axb,

I'm not sure I quite believe it. And I'm not quite sure I trust you. But
you do make an attractive pitch. Excellent spam filtering, system-wide,
with no responsibility for training on the part of the users?


YOu don't need to trust me or believe me (I'm not selling anything - 
just commenting on what works for me)


You can try it and after a couple of weeks, see if it works for you and 
then if necessary come up with new methods for extra training or dump 
the concept totally.


Bayes is yet another scoring mechanism in SA. If you have enough 
traffic, you can wipe the data any time and it's not like you're 
switching SA off totally.


During the dev/test process of the Redis backend, as stuff changed on a 
daily basis I was forced to purge the Bayes data several times/week.

It even became a running joke (wave Henrik/Marc).


This sounds like the kind of "too good to be true" message that I'd
expect to receive in a spam mail.


:-)



But hmm. This is good dream material for tonight. I wonder if our Ubuntu
14.04 upgrade has SA 3.4 with redis built in. I do hear that the redis
backend is amazing.


Ever thought of running a newer distro in a VM, only for SA and let 
spamass-milter use that?
That would mean you can play with SA 3.4 without having to redo all your 
mail infra?

Re: Bayes, Manual and Auto Learning Strategies

2014-07-02 Thread Axb


On 07/02/2014 08:48 AM, Steve Bergman wrote:

Someone, please convince me that I should turn it on.


autolearn doesn't mean you cannot also train manually...


Should I turn it on and take my "train as ham" entry out of .forward? Or
should I not?


manually training ham from unreviewed data?
bad idea.


I suppose that largely depends upon my individual users' levels of
diligence.


and don't count on that - they may do it the first week, new toy, 
but for how long?


Also: take in mind each user's Bayes folder also get a a bayes_seen file 
which grows and grows and grows and never gets truncated.


Do you really want to spend time watching each user's Bayes?

Re: Bayes, Manual and Auto Learning Strategies

2014-07-02 Thread Steve Bergman


Axb,

I'm not sure I quite believe it. And I'm not quite sure I trust you. But 
you do make an attractive pitch. Excellent spam filtering, system-wide, 
with no responsibility for training on the part of the users?


This sounds like the kind of "too good to be true" message that I'd 
expect to receive in a spam mail.


But hmm. This is good dream material for tonight. I wonder if our Ubuntu 
14.04 upgrade has SA 3.4 with redis built in. I do hear that the redis 
backend is amazing.


-Steve

Re: Bayes, Manual and Auto Learning Strategies

2014-07-01 Thread Steve Bergman

Well... I just turned on autolearn for a moment, deleted the bayes_* 
files on the test account I use, and sent myself a message from my usual 
outside account. And new bayes_* files were created. So I was wrong, and 
I win. More options.


So now I can proceed to the "what does this mean?" phase.

If I leave things as they are, then training is perfect if the users are 
diligent. But if they are not, then... what? I see plenty of spams 
getting through with a 0.0 score. IIRC, the autolearn spam threshold is 
7? Pretty much everything there is spam.


But I'm not sure I quite buy having the static rules of SA training 
Bayes. Isn't Bayes just learning to emulate the static rules, with all 
their imperfections?


If it starts going wrong, doesn't that mean the errors are going to 
spiral out of control?


Leaving autolearn off puts everything in the hands of the users. And 
that's where I've left things for now.


Someone, please convince me that I should turn it on.

Should I turn it on and take my "train as ham" entry out of .forward? Or 
should I not?


I suppose that largely depends upon my individual users' levels of 
diligence.


-Steve

Re: Bayes, Manual and Auto Learning Strategies

2014-07-01 Thread Axb


On 07/02/2014 08:00 AM, Steve Bergman wrote:



On 07/02/2014 12:52 AM, Axb wrote:

Site wide bayes works VERY well even under such ugly conditions as
traffic with multiple languages, for ham as well as spam.


Please tell me more.

This goes against Paul Graham's orginal advice, IIRC. And it goes
against intuition. Then again. Bayesian statistics go against intuition.

It's hard to let go and trust a systen-wide Bayes. But I'm listening...


It works, trust me. SA's Bayes implementation is incredibly robust.


My site wide Bayes DB is not exactly small.

0.000  0   23850755  0  non-token data: nspam
0.000  0   10702302  0  non-token data: nham

Would I run a monster this size of it didn't work? Nope.

I waited a long time to be able to use something really 100% site wide 
(not per server) till we got the ability to use Redis which was FAST, 
robust and doesn't cause me headaches as sql, file permissions issues, etc.


I can't give you a scientific reason for not using per user Bayes
Site wide works for my +2000 corp domains which includes .tr, .ru, .cn, 
.ua, .es, .fr,.de plus a ton of other major CCtld domains


AND: I only run autolearn. NO manual/scheduled training.

Re: Bayes, Manual and Auto Learning Strategies

2014-07-01 Thread John Hardin


On Wed, 2 Jul 2014, Steve Bergman wrote:




On 07/01/2014 11:14 PM, John Hardin wrote:


 Autolearn trains the bayes database. The bayes data is stored wherever
 you configured it to be stored, in a DBM database or SQL or redis, and
 it's per-user if you configure per-user Bayes databases and scan emails
 using different usernames (vs. a global user like root or amavis).


That is interesting. How sure are you of this? Because if you're pretty sure, 
it's a piece of information I've been keen to confirm for a while.


The bayes database is the only thing in SA that can be trained. (I'm 
excluding submission of the message to pyzor et. al. because that's 
obviously not local.)


Odd, though, that before I set up .forward to train incoming mails as ham and 
disabled autolearn, no nhams were showing up in "sa-learn --dump magic" for 
the individual users. Just nspams.


That is rather odd. Very-low-scoring hams should be autolearned as ham 
unless the default thresholds have been changed.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  News flash: Lowest Common Denominator down 50 points
---
 3 days until the 238th anniversary of the Declaration of Independence

Re: Bayes, Manual and Auto Learning Strategies

2014-07-01 Thread Steve Bergman




On 07/02/2014 12:52 AM, Axb wrote:

Site wide bayes works VERY well even under such ugly conditions as
traffic with multiple languages, for ham as well as spam.


Please tell me more.

This goes against Paul Graham's orginal advice, IIRC. And it goes 
against intuition. Then again. Bayesian statistics go against intuition.


It's hard to let go and trust a systen-wide Bayes. But I'm listening...

-Steve

Re: Bayes, Manual and Auto Learning Strategies

2014-07-01 Thread Axb


On 07/02/2014 07:37 AM, Steve Bergman wrote:



Lets turn this around?  Can you prove autolearn was ever done to memory?


I'm not really interested in proving anything. I'm interested in being
convinced that autolearn is individual file-based when spamc is run as
the individual user.


It's in the code... but yes, autolearn is always file based and respects 
the per user settings unless you run  spamd with -x



I'm not quite sure how that would affect my strategy. But it might (or
might not) make autolearn useful.


More important, you may need to reconsider is if per user Bayes will 
give you the level of quality you're aiming for, and from experience I 
can tell you: it won't.


Site wide bayes works VERY well even under such ugly conditions as 
traffic with multiple languages, for ham as well as spam.

Re: Bayes, Manual and Auto Learning Strategies

2014-07-01 Thread Steve Bergman




Lets turn this around?  Can you prove autolearn was ever done to memory?


I'm not really interested in proving anything. I'm interested in being 
convinced that autolearn is individual file-based when spamc is run as 
the individual user.


I'm not quite sure how that would affect my strategy. But it might (or 
might not) make autolearn useful.


-Steve

Re: Bayes, Manual and Auto Learning Strategies

2014-07-01 Thread Axb


On 07/02/2014 07:19 AM, Steve Bergman wrote:



On 07/01/2014 11:49 PM, Karsten Bräckelmann wrote:


Those do not tell you about using file or SQL based databases?


They do. But not specifically with respect to autolearn.

You never

thought about googling for "spamassassin per user" and friends? You
never checked the SA wiki?


I have, indeed. No reference to autolearn and persistent storage. The
lack of mention is notable.

I'd expect people to be lining up to tell me I'm mistaken if I
absolutely were.

Can you point me to a change log somewhere documenting autolearn moving
from in-memory and system-wide to per user and persistent?

I don't hold a strong opinion on this. It would be nice if I were wrong.
It would open more options.

I'm just waiting for evidence that it's the case. My perception is that
It's not.


Lets turn this around?  Can you prove autolearn was ever done to memory?

If you mean  "autolearn to journal", this is also file based.

I've been using SA since before it was an Apache project, when it was 
developed by McAfee and the sources were on Sourceforge and back then it 
was already file based.

Re: Bayes, Manual and Auto Learning Strategies

2014-07-01 Thread Steve Bergman




On 07/01/2014 11:14 PM, John Hardin wrote:


Autolearn trains the bayes database. The bayes data is stored wherever
you configured it to be stored, in a DBM database or SQL or redis, and
it's per-user if you configure per-user Bayes databases and scan emails
using different usernames (vs. a global user like root or amavis).



That is interesting. How sure are you of this? Because if you're pretty 
sure, it's a piece of information I've been keen to confirm for a while.


Odd, though, that before I set up .forward to train incoming mails as 
ham and disabled autolearn, no nhams were showing up in "sa-learn --dump 
magic" for the individual users. Just nspams.


-Steve

Re: Bayes, Manual and Auto Learning Strategies

2014-07-01 Thread Steve Bergman




On 07/01/2014 11:49 PM, Karsten Bräckelmann wrote:


Those do not tell you about using file or SQL based databases?


They do. But not specifically with respect to autolearn.

You never

thought about googling for "spamassassin per user" and friends? You
never checked the SA wiki?


I have, indeed. No reference to autolearn and persistent storage. The 
lack of mention is notable.


I'd expect people to be lining up to tell me I'm mistaken if I 
absolutely were.


Can you point me to a change log somewhere documenting autolearn moving 
from in-memory and system-wide to per user and persistent?


I don't hold a strong opinion on this. It would be nice if I were wrong. 
It would open more options.


I'm just waiting for evidence that it's the case. My perception is that 
It's not.


-Steve

Re: Bayes, Manual and Auto Learning Strategies

2014-07-01 Thread Karsten Bräckelmann

On Tue, 2014-07-01 at 22:40 -0500, Steve Bergman wrote:
> On 07/01/2014 10:21 PM, Karsten Bräckelmann wrote:
> >
> > http://spamassassin.apache.org/doc/Mail_SpamAssassin_Conf.html
> > http://spamassassin.apache.org/doc/Mail_SpamAssassin_Plugin_AutoLearnThreshold.html
> 
> I've read those over and over. It never says anything about where the 
> data is maintained, or whether it's per-user or not. The *only* solid 
> claim I have is a ten year old (yes, at the dawn of SA Bayes) post which 
> specifically says it's in memory, system-wide, and lost upon SA restart.

Those do not tell you about using file or SQL based databases? You never
thought about googling for "spamassassin per user" and friends? You
never checked the SA wiki?

FWIW, the links given do NOT refer to in-memory only at all.

An in-memory only Bayes database definitely is much more than ten years
ago. If it ever existed. No need for me to even check.

> > Milter usually means system-wide. (But since you just asked, it is.)
> 
> I'm using spamass-milter. It suid's to the recipient user for most 
> mails. For aliases it defaults to a particular user who gets an 
> unbelievable amount of spam at the gate, and whom I know sorts his 
> ham/spam religiously.

So you want to check back with your specific setup and its docs.
Suid'ing is pretty likely to be per-user, though the definition of user
is not specifically clear in the context of a milter (and the final
recipient).

In either case, that is not SA specific. (SA happily uses both, per-user
or site-wide config AND bayes database, depending on context.) Refer to
your milter's docs.

> > Irrespective of your feeling -- cheers!  /me having a beer
> 
> Whew! After the conversations I've had here, today, I need one, too! ;-)

Don't see this as an attack on you. It isn't. Just pointers on helping
your understanding of the situation and your issues. Not always gentle,
but that also reflects the initial stance.

-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}

Re: Bayes, Manual and Auto Learning Strategies

2014-07-01 Thread Karsten Bräckelmann

On Tue, 2014-07-01 at 22:18 -0500, Steve Bergman wrote:
> On 07/01/2014 09:53 PM, Karsten Bräckelmann wrote:
> 
> > Frankly, it appears you don't understand what auto-learning is.
> 
> So please specify, explicitly, what it is. I asked some specific 
> questions about it. And I'm very interested in the answers.

If you want my opinion, please re-phrase your questions. I locally
deleted most of this previous (originally unrelated) thread.

> Is auto-learn still system-wide? I'd need it to apply to individual 
> users. Is it in-memory only? Or can I have it update the users' filedb 
> token databases?

SA itself never was system-wide, neither user-specific. It is both, can
be either. It depends on the context of calling SA.

> If it's now per user and uses the user databases, then I am more than 
> ready to reconsider my opinion. But I've not been able to get a clear 
> answer to this. I haven't had an opportunity to test. And I'd want 
> confirmation from someone in the know anyway, before I changed strategies.

It does not depend on SA, but on how you invoke SA. We cannot give you a
clear answer. It depends on your system, your SMTP, glue, system wide
calling of SA, and possibly per-user invocations even after system-wide.

To be clear: SA is a filter. It does nothing itself, other than
classification. Being called, and at which point, is outside the scope
of SA. Rejecting, deleting, delivering or any other kind of action is
outside the scope of SA. That's actions performed by the calling layer,
based on the result of SA evaluation.

> >> This method shields the user from the worst of the spam, while giving
> >> them full control of what gets relearned as spam.
> >
> > Wrong. It is not "this" (your) method, that shields the user from the
> > worst of the spam. That's SA. Not your style of auto-training.
> 
> Mine is not autotraining at all. it's giving the user a way of 
> explicitly training the backend spam filter.

Quoting your previous post, you "have a line in the users' default
.forward file to train incoming mail as ham". That is auto-training.

> > (Besides, you *are* doing auto-learning, which you just claimed to be a
> > complete joke.)
> 
> No. The messages are assumed ham until the user classifies it as spam. 
> It is explicit learning. Under user control,

Being "assumed" is not the same as being "treated and automatically
reinforced". The latter is what you do. (And btw, Yes. You are
auto-learning.)

> > At this point I won't get into details. It should suffice to highlight
> > that a default ham auto-learning threshold of 0.1 is part of the safety
> > concepts. (See the M::SA::Plugin::AutoLearnThreshold man-page for more.)
> 
> I really don't think you understand what it is I'm doing. Anything below 
> a score of 5.0 goes into their mailbox and learned as ham. If it's ham, 
> that's great. If it's spam, they move it to Junk and it gets learned as 
> spam. auto-learn is as brain dead as the defunct AWL.

I perfectly understood what you are doing.

You didn't understand why that is bad. Failing to explain might be my
bad, though I'll leave re-explaining for tomorrow my timezone. Or you
carefully re-reading my posts.

> > I never checked the TB internal Bayes implementation and auto-learn
> > strategy, but I'd be surprised if they do train on black/white, without
> > any gray area in between.
> 
> Optimally, I would have an "incoming folder" and then the user could 
> manually move the messages from there to spam or ham. But considering 

Which is basically what you came from, using Dovecot antispam plugin
with SA, and dedicated folders "where the user could manually move the
messages" to. Why didn't you just set that up?

(Hint: That's your set-up without auto-learning ham Inbox deliveries.)

> that this was not even remotely necessary with our old email provider, I 
> don't feel that I can put my users to that level of extra trouble that 
> they never even thought about having to deal with before, just because 
> SA is not performing as well as the spam filter they are used to. The 

Do initial manual training. Then get back to us.

> mail needs to go into the inbox directly. And for SA's bayesian tp work, 
> it needs to be assumed as ham initially.

No.

It seems your previous "email provider", whatever that might be, had
some sort of spam filtering service. Now you're on your own.

Which you are, unless you decide to ask for free (as in beer) support by
the community providing the software for free (as in speech) to help you
weed out the spam. You did ask, which is just fine, but your assumptions
are kind of hostile. Like your previous "email provider" would not use
SA internally. He most likely does.

> The only thing I see which might change my view would be explicit 
> details about where autolearn stores its data and how it is used on a 
> per user basis.

So the only thing that might change your view would be reading the docs.
Go read them.

Auto-learn stores its data exactly wher

Re: Bayes, Manual and Auto Learning Strategies

2014-07-01 Thread John Hardin


On Tue, 1 Jul 2014, Steve Bergman wrote:




On 07/01/2014 10:21 PM, Karsten Bräckelmann wrote:


http: //spamassassin.apache.org/doc/Mail_SpamAssassin_Conf.html
http: 
//spamassassin.apache.org/doc/Mail_SpamAssassin_Plugin_AutoLearnThreshold.html


I've read those over and over. It never says anything about where the data is 
maintained, or whether it's per-user or not. The *only* solid claim I have is 
a ten year old (yes, at the dawn of SA Bayes) post which specifically says 
it's in memory, system-wide, and lost upon SA restart.


Autolearn trains the bayes database. The bayes data is stored wherever you 
configured it to be stored, in a DBM database or SQL or redis, and it's 
per-user if you configure per-user Bayes databases and scan emails using 
different usernames (vs. a global user like root or amavis).


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  News flash: Lowest Common Denominator down 50 points
---
 3 days until the 238th anniversary of the Declaration of Independence

Re: Bayes, Manual and Auto Learning Strategies

2014-07-01 Thread Steve Bergman




On 07/01/2014 10:21 PM, Karsten Bräckelmann wrote:


http://spamassassin.apache.org/doc/Mail_SpamAssassin_Conf.html
http://spamassassin.apache.org/doc/Mail_SpamAssassin_Plugin_AutoLearnThreshold.html


I've read those over and over. It never says anything about where the 
data is maintained, or whether it's per-user or not. The *only* solid 
claim I have is a ten year old (yes, at the dawn of SA Bayes) post which 
specifically says it's in memory, system-wide, and lost upon SA restart.



Milter usually means system-wide. (But since you just asked, it is.)


I'm using spamass-milter. It suid's to the recipient user for most 
mails. For aliases it defaults to a particular user who gets an 
unbelievable amount of spam at the gate, and whom I know sorts his 
ham/spam religiously.




Which, referring to my previous post, also means, a single sloppy user
deleting your custom-auto-learned FN ham messages affects all your other
users.


No. I make sure to keep each user solely responsible for their own email 
welfare.



Irrespective of your feeling -- cheers!  /me having a beer


Whew! After the conversations I've had here, today, I need one, too! ;-)


-Steve

Re: Bayes, Manual and Auto Learning Strategies

2014-07-01 Thread Karsten Bräckelmann

On Tue, 2014-07-01 at 20:53 -0500, Steve Bergman wrote:
> On 07/01/2014 07:32 PM, Karsten Bräckelmann wrote:
> 
> > That's pretty bad practice. Fundamentally, you are implementing a custom
> > auto-learn flavor, overruling the SA configurable auto-learn behavior
> 
> BTW, that reminds me of a question I had been meaning to ask on the 
> list. Autolearn. There's very little written about it, so far as I am 

http://spamassassin.apache.org/doc/Mail_SpamAssassin_Conf.html
http://spamassassin.apache.org/doc/Mail_SpamAssassin_Plugin_AutoLearnThreshold.html

> aware. But from what I have gleaned, from old posts, is that it is 
> system-wide and in-memory.

It depends on how you call SA (SMTP or MDA level). SA itself is a
filter, called by your mail-processing chain. Thus, there is no SA
default context of system-wide or per-user. It depends on how you call
it.

> Now, I have Spamass-milter set to run SA 3.3 
> as the recipient user, using the filedb backend. So in 3.3, is autolearn 
> system wide and in memory, or per user and on disk?

Milter usually means system-wide. (But since you just asked, it is.)

Which, referring to my previous post, also means, a single sloppy user
deleting your custom-auto-learned FN ham messages affects all your other
users. Or a non-sloppy, but on-vacation-mode user.

Moreover, there is no in-memory only, not on-disk mode. Unless you don't
have to ask about it.

> This makes a difference regarding what Karsten and I are discussing. I 
> don't suppose I would object to being wrong. But I have a feeling that 
> I'm right.

Irrespective of your feeling -- cheers!  /me having a beer

-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}

Re: Bayes, Manual and Auto Learning Strategies

2014-07-01 Thread Steve Bergman




On 07/01/2014 09:53 PM, Karsten Bräckelmann wrote:


Frankly, it appears you don't understand what auto-learning is.


So please specify, explicitly, what it is. I asked some specific 
questions about it. And I'm very interested in the answers.


Is auto-learn still system-wide? I'd need it to apply to individual 
users. Is it in-memory only? Or can I have it update the users' filedb 
token databases?


If it's now per user and uses the user databases, then I am more than 
ready to reconsider my opinion. But I've not been able to get a clear 
answer to this. I haven't had an opportunity to test. And I'd want 
confirmation from someone in the know anyway, before I changed strategies.





This method shields the user from the worst of the spam, while giving
them full control of what gets relearned as spam.


Wrong. It is not "this" (your) method, that shields the user from the
worst of the spam. That's SA. Not your style of auto-training.



Mine is not autotraining at all. it's giving the user a way of 
explicitly training the backend spam filter.



And unless you disabled Bayes auto-learning in SA (dunno, might have
been mentioned deep in the thread), the user does not have full control
of what gets relearned as spam.



I have disabled autolearning. I thought I mentioned that to you.



(Besides, you *are* doing auto-learning, which you just claimed to be a
complete joke.)


No. The messages are assumed ham until the user classifies it as spam. 
It is explicit learning. Under user control,




At this point I won't get into details. It should suffice to highlight
that a default ham auto-learning threshold of 0.1 is part of the safety
concepts. (See the M::SA::Plugin::AutoLearnThreshold man-page for more.)



I really don't think you understand what it is I'm doing. Anything below 
a score of 5.0 goes into their mailbox and learned as ham. If it's ham, 
that's great. If it's spam, they move it to Junk and it gets learned as 
spam. auto-learn is as brain dead as the defunct AWL.




I never checked the TB internal Bayes implementation and auto-learn
strategy, but I'd be surprised if they do train on black/white, without
any gray area in between.


Optimally, I would have an "incoming folder" and then the user could 
manually move the messages from there to spam or ham. But considering 
that this was not even remotely necessary with our old email provider, I 
don't feel that I can put my users to that level of extra trouble that 
they never even thought about having to deal with before, just because 
SA is not performing as well as the spam filter they are used to. The 
mail needs to go into the inbox directly. And for SA's bayesian tp work, 
it needs to be assumed as ham initially.


The only thing I see which might change my view would be explicit 
details about where autolearn stores its data and how it is used on a 
per user basis.


-Steve

Re: Bayes, Manual and Auto Learning Strategies

2014-07-01 Thread Karsten Bräckelmann

On Tue, 2014-07-01 at 20:36 -0500, Steve Bergman wrote:
> On 07/01/2014 07:32 PM, Karsten Bräckelmann wrote:
> >
> > That's pretty bad practice. Fundamentally, you are implementing a custom
> > auto-learn flavor, overruling the SA configurable auto-learn behavior
> 
> SA's autolearn behavior doesn't make much sense. I have no confidence in it.

The auto-learning feature is NOT meant to be a fully automated training
system. It's an aid for the user to eliminate the need to care about the
extremes, while focusing on the close-calls. There are options to tweak
to your specific needs, and there even is no single "SA autolearn
behavior" as you stated, but different flavors. And an option to turn it
off.

Frankly, it appears you don't understand what auto-learning is.

> This method shields the user from the worst of the spam, while giving 
> them full control of what gets relearned as spam.

Wrong. It is not "this" (your) method, that shields the user from the
worst of the spam. That's SA. Not your style of auto-training.

And unless you disabled Bayes auto-learning in SA (dunno, might have
been mentioned deep in the thread), the user does not have full control
of what gets relearned as spam.

> > and ignoring all safety concepts implemented by SA.
> 
> What safety concepts? autolearn is a complete joke. Even the docs 
> explain that it's only there as a last resort method of kinda sorta 
> training the spam filter.

You are doing (custom) auto-learning as ham of any message with a score
less than required_score of 5.0. *That* is a joke.

(Besides, you *are* doing auto-learning, which you just claimed to be a
complete joke.)

At this point I won't get into details. It should suffice to highlight
that a default ham auto-learning threshold of 0.1 is part of the safety
concepts. (See the M::SA::Plugin::AutoLearnThreshold man-page for more.)

> > So if a user in a hurry simply deletes some spam, it will remain ham, as
> > far as Bayes is concerned.
> 
> Same as with Thunderbird, I think.

I never checked the TB internal Bayes implementation and auto-learn
strategy, but I'd be surprised if they do train on black/white, without
any gray area in between.

You stated it. Please back up your claim.

> And it's working very well for them. 
> If they act irresponsibly, they'll get more spam. It takes no longer to 
> highlight the spam and click "Junk" than it does to highlight the spam 
> and click "Delete".

While I am aware I'm not the average user -- there's a "delete" action
key on my keyboard. There's no "junk" equivalent. Yes, I avoid using the
mouse if keyboard interaction is more productive...

> I've pretty much decided at this point that if the users don't do what I 
> tell them to do, repeatedly, then what results is not my responsibility.
> 
> And it's not.

Do you hate your users or your job? (Sorry, snide-remark I couldn't
resist. Feel free to ignore.)

> The alternative is to not mark incoming mail as ham, and allow the SA 
> Bayesian filter to remain inactive forever.

No. I can only guess, but it appears there are some mis-interpretations
in that conclusion.

The SA Bayesian classifier to "remain inactive forever" can only refer
to insufficient initial training. Manual training. Of at least 200 ham
and spam each (by default, you can lower that to 0). You will easily get
that by manual training of existing messages. And even default auto-
learning would eventually cross the ham number. Less than forever.

More importantly, SA still marks (classifies) incoming mail as ham. Just
because its overall score is less than 5.0. It just does not *learn* all
of them as ham. Because there's a chance it might not actually be ham,
but a FN.

That area, between (default) auto-learning as ham and classifying as
spam is the gray area, where actual user input is of much value. For
both, learning spam AND ham, for that matter. In particular, because
generally (and as SA principle), a FP is *much* worse than a FN.

Your approach of force learning those as ham, is biasing your Bayes DB.
At the very least temporarily (unless a fresh spam campaign has been
re-trained by your users on Monday). At worst, until you clear it.

Btw, is that per-user, or are you gambling a site-wide Bayes DB?

> I opted to give the users the choice of being responsible for sorting, 
> and reaping the benefits of that if they do. And yes, I know that some 
> are not going to.
> 
> I'd be interested if you have a better solution in mind.

Do not auto-learn ham every message that scores below required_score.

Introduce train-on-error for your users, with an extended manual
training option. Specific ham and spam folders, where moving or copying
mail into trains the Bayes classifier. Kind of optional for the user,
unless they feel there's too much mis-classification.

-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}

Re: Bayes, Manual and Auto Learning Strategies

2014-07-01 Thread Steve Bergman




On 07/01/2014 07:32 PM, Karsten Bräckelmann wrote:


That's pretty bad practice. Fundamentally, you are implementing a custom
auto-learn flavor, overruling the SA configurable auto-learn behavior


BTW, that reminds me of a question I had been meaning to ask on the 
list. Autolearn. There's very little written about it, so far as I am 
aware. But from what I have gleaned, from old posts, is that it is 
system-wide and in-memory. Now, I have Spamass-milter set to run SA 3.3 
as the recipient user, using the filedb backend. So in 3.3, is autolearn 
system wide and in memory, or per user and on disk?


This makes a difference regarding what Karsten and I are discussing. I 
don't suppose I would object to being wrong. But I have a feeling that 
I'm right.


-Steve

Re: Bayes, Manual and Auto Learning Strategies

2014-07-01 Thread Steve Bergman




On 07/01/2014 07:32 PM, Karsten Bräckelmann wrote:


That's pretty bad practice. Fundamentally, you are implementing a custom
auto-learn flavor, overruling the SA configurable auto-learn behavior


SA's autolearn behavior doesn't make much sense. I have no confidence in it.

This method shields the user from the worst of the spam, while giving 
them full control of what gets relearned as spam.



and ignoring all safety concepts implemented by SA.


What safety concepts? autolearn is a complete joke. Even the docs 
explain that it's only there as a last resort method of kinda sorta 
training the spam filter.




So if a user in a hurry simply deletes some spam, it will remain ham, as
far as Bayes is concerned.


Same as with Thunderbird, I think. And it's working very well for them. 
If they act irresponsibly, they'll get more spam. It takes no longer to 
highlight the spam and click "Junk" than it does to highlight the spam 
and click "Delete".


I've pretty much decided at this point that if the users don't do what I 
tell them to do, repeatedly, then what results is not my responsibility.


And it's not.

The alternative is to not mark incoming mail as ham, and allow the SA 
Bayesian filter to remain inactive forever.


I opted to give the users the choice of being responsible for sorting, 
and reaping the benefits of that if they do. And yes, I know that some 
are not going to.


I'd be interested if you have a better solution in mind.

-Steve

Bayes, Manual and Auto Learning Strategies (was: Re: getting tons of SPAM)

2014-07-01 Thread Karsten Bräckelmann

On Tue, 2014-07-01 at 18:43 -0500, Steve Bergman wrote:
> On 07/01/2014 06:09 PM, RW wrote:
> > I'm sceptical about the use of Dovecot-Antispam with Spamassassin.
> > The problem is that it trains on SpamAssassin errors rather than Bayes
> > errors. It may be possible to get sufficient spam this way, but ham
> > is learned very slowly through avoidable FPs.
> 
> We currently (early days for this installation) get plenty of spam for 
> the users to train by moving it to the junk folder. Ham was the problem. 
> Dovecot does nothing about training ham.

Dovecot (and its antispam plugin) does nothing about training ham,
either. It offers target folders and triggers, for easy manual (re-)
classification -- and thus training -- of ham and spam.

> That's why I have a line in the users' default .forward file to train
> incoming mail as ham.

That's pretty bad practice. Fundamentally, you are implementing a custom
auto-learn flavor, overruling the SA configurable auto-learn behavior
and ignoring all safety concepts implemented by SA. There's a reason for
the ham and spam learning thresholds, and the ham threshold to be 0.1 by
default, *not* equaling required_score's default of 5.0.

> Then if they or Thunderbird decide to move the mail to Junk, it gets
> re-trained as spam.

So if a user in a hurry simply deletes some spam, it will remain ham, as
far as Bayes is concerned.

> dovecot-antispam is *not* a complete solution, so far as I can see.
> 
> At this early stage, it *is* painful to watch all that spam coming in 
> over the weekend getting trained as ham. I tell my users to mark it as 
> spam on Monday morning. And if they don't, I just figure it's not my fault.

It is your fault to implement a broken training strategy.

> Once the token databases get larger there won't be so much potential 
> flux back and forth, I guess.

-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}

42 matches

Mail list logo