Re: BAYES question

2013-04-28 Thread Matus UHLAR - fantomas

Joe Acquisto-j4 skrev den 2013-04-27 13:37:

Very interesting.   However, I don't see any BAYES_xx markings in the
headers at all.


On 27.04.13 19:00, Joe Acquisto-j4 wrote:

I seem to have not stated my query clearly, as several have suggested this.
Or, it was perfectly understood, but I am not comprehending.

I don't want to know how to see the tokens, etc (I do, but already know how).
I was curious about this BAYES_xx thing, which I gather is something I should
see in a message header.


In one of your former e-mails you were complaining about spam hitting
BAYES_50.  What did change since?  Did you clear the bad BAYES database? 
Look at it again, folder and file permissions, and --dump magic if it

contains enough ham and spam.

--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Posli tento mail 100 svojim znamim - nech vidia aky si idiot
Send this email to 100 your friends - let them see what an idiot you are


Re: BAYES question

2013-04-27 Thread Jari Fredriksson
27.04.2013 04:54, Karsten Bräckelmann kirjoitti:
 And it is good advice to keep the initial training corpora to a
 ratio roughly assembling your ham/spam ratio, or maybe 1/1. (At this
 point, we're approaching woodoo. Learning 10 times more ham than spam is
 most likely to be a bad choice, though.)
I don't see any problem with having a corpus like this:

0.000  0  28252  0  non-token data: nspam
0.000  0 187579  0  non-token data: nham

I have no problems with Bayes whatsoever.

-- 

There's small choice in rotten apples.
-- William Shakespeare, The Taming of the Shrew




signature.asc
Description: OpenPGP digital signature


Re: BAYES question

2013-04-27 Thread Axb

On 04/27/2013 10:59 AM, Jari Fredriksson wrote:

27.04.2013 04:54, Karsten Bräckelmann kirjoitti:

And it is good advice to keep the initial training corpora to a
ratio roughly assembling your ham/spam ratio, or maybe 1/1. (At this
point, we're approaching woodoo. Learning 10 times more ham than spam is
most likely to be a bad choice, though.)

I don't see any problem with having a corpus like this:

0.000  0  28252  0  non-token data: nspam
0.000  0 187579  0  non-token data: nham

I have no problems with Bayes whatsoever.


how many users? domains?
Can hardly be a heavily spammed setup or it would look more like:

0.000  07762525  0  non-token data: nspam
0.000  04171794  0  non-token data: nham
(a week's worth of tokens)






Re: BAYES question

2013-04-27 Thread Jari Fredriksson
27.04.2013 12:03, Axb kirjoitti:
 On 04/27/2013 10:59 AM, Jari Fredriksson wrote:
 27.04.2013 04:54, Karsten Bräckelmann kirjoitti:
 And it is good advice to keep the initial training corpora to a
 ratio roughly assembling your ham/spam ratio, or maybe 1/1. (At this
 point, we're approaching woodoo. Learning 10 times more ham than
 spam is
 most likely to be a bad choice, though.)
 I don't see any problem with having a corpus like this:

 0.000  0  28252  0  non-token data: nspam
 0.000  0 187579  0  non-token data: nham

 I have no problems with Bayes whatsoever.

 how many users? domains?
 Can hardly be a heavily spammed setup or it would look more like:

 0.000  07762525  0  non-token data: nspam
 0.000  04171794  0  non-token data: nham
 (a week's worth of tokens)





Only me for SPAM  HAM and my colleagues for spam. While I try and
collect spam wherever I can, the amount of spam has been dropped big
time during the couple of years. My boss seems to draw most of the spam
of my sources ;)

The ham corpus contains also many List-Id (mailing lists). That means
they are included in my Bayes training, not in my ruleqa. And I do skim
them thru, and move possible spam from them to my spam corpus (not to
ruleqa though).



-- 

For a light heart lives long.
-- Shakespeare, Love's Labour's Lost




signature.asc
Description: OpenPGP digital signature


Re: BAYES question

2013-04-27 Thread Joe Acquisto-j4
. . .
 Do train those, which have a Bayesian probability close(r) to 0.5. Or
 even worse, have a Bayesian probability contrary to the overall score,
 or actual classification.
 
 Training the plethora of spam hitting BAYES_99 might not be a mistake.
 But it is pretty likely, to *not* improve general SA performance.
 
 You're training Bayes. Not SpamAssassin.
 
 

Very interesting.   However, I don't see any BAYES_xx markings in the headers 
at all.

I assumed that is because it is not scoring yet, due to low samples.  Or some 
other reason.

How do I find that number?

joe a.



Re: BAYES question

2013-04-27 Thread Matus UHLAR - fantomas

Do train those, which have a Bayesian probability close(r) to 0.5. Or
even worse, have a Bayesian probability contrary to the overall score,
or actual classification.

Training the plethora of spam hitting BAYES_99 might not be a mistake.
But it is pretty likely, to *not* improve general SA performance.

You're training Bayes. Not SpamAssassin.


On 27.04.13 07:37, Joe Acquisto-j4 wrote:

Very interesting.   However, I don't see any BAYES_xx markings in the
headers at all.

I assumed that is because it is not scoring yet, due to low samples.  Or
some other reason.

How do I find that number?


sa-learn --dump magic, of course.
You need at least 200 hams and 200 spams for BAYES to start firing.
At the begin, you can train ANY mail. Later, it's easier just to correct 
mail with misfired score (ham not BAYES_00 and spam not BAYES_99)


--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
BSE = Mad Cow Desease ... BSA = Mad Software Producents Desease


Re: BAYES question

2013-04-27 Thread Benny Pedersen

Joe Acquisto-j4 skrev den 2013-04-27 01:38:


path-to-ham  as one might feed missed spam, sa-learn --spam
path-to-spam


yes, but if you sort based on scores there is no point in using bayes 
in the first place


only thing that is important is to feed what is spam and what is ham to 
learning


--
senders that put my email into body content will deliver it to my own 
trashcan, so if you like to get reply, dont do it


Re: BAYES question

2013-04-27 Thread Benny Pedersen

Joe Acquisto-j4 skrev den 2013-04-27 13:37:


Very interesting.   However, I don't see any BAYES_xx markings in the
headers at all.


how is you bayes setup ?

what gives sa-learn --dump magic ?


I assumed that is because it is not scoring yet, due to low samples.
Or some other reason.


that could be the reason, others might be diff users bayes learning


How do I find that number?


--dump magic

--
senders that put my email into body content will deliver it to my own 
trashcan, so if you like to get reply, dont do it


Re: BAYES question

2013-04-27 Thread Benny Pedersen

Jari Fredriksson skrev den 2013-04-27 10:59:


0.000  0  28252  0  non-token data: nspam
0.000  0 187579  0  non-token data: nham

I have no problems with Bayes whatsoever.


this is an good working mta setup, not a bayes problem :)

--
senders that put my email into body content will deliver it to my own 
trashcan, so if you like to get reply, dont do it


Re: BAYES question

2013-04-27 Thread Niamh Holding

Hello John,

Saturday, April 27, 2013, 12:50:34 AM, you wrote:

JH Simple rule: train any ham that doesn't hit BAYES_00.

???

What about ham that hits BAYES_00 and shows autolearn=no ?

-- 
Best regards,
 Niamhmailto:ni...@fullbore.co.uk

pgp3P8oEu1ldu.pgp
Description: PGP signature


Re: BAYES question

2013-04-27 Thread Benny Pedersen

Niamh Holding skrev den 2013-04-27 18:25:


What about ham that hits BAYES_00 and shows autolearn=no ?


if its spam, sa-learn --spam msg

else the above is ok, its no need to learn if it already is learned as 
ham


--
senders that put my email into body content will deliver it to my own 
trashcan, so if you like to get reply, dont do it


Re: BAYES question

2013-04-27 Thread Jari Fredriksson
27.04.2013 18:24, Benny Pedersen kirjoitti:
 Jari Fredriksson skrev den 2013-04-27 10:59:

 0.000  0  28252  0  non-token data: nspam
 0.000  0 187579  0  non-token data: nham

 I have no problems with Bayes whatsoever.

 this is an good working mta setup, not a bayes problem :)

My MTA does not reject anything. And I collect all spam from Gmail 
others sources just to get spam. I love spam.

-- 

Your reasoning powers are good, and you are a fairly good planner.




signature.asc
Description: OpenPGP digital signature


Re: BAYES question

2013-04-27 Thread John Hardin

On Fri, 26 Apr 2013, Joe Acquisto-j4 wrote:


So, I could just feed a bunch of good mail, to --ham, and spam that is 
correctly marked
as spam as well as missed spam, to --spam?


Correct; the important part is that what you train with must be *correctly 
classified* - training a ham as spam is not helpful... :)


Hang onto that as part of your base corpus in case you need to retrain.

--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  Your mouse has moved. Your Windows Operating System must be
  relicensed due to this hardware change. Please contact Microsoft
  to obtain a new activation key. If this hardware change results in
  added functionality you may be subject to additional license fees.
  Your system will now shut down. Thank you for choosing Microsoft.
---
 331 days since the first successful private support mission to ISS (SpaceX)


Re: BAYES question

2013-04-27 Thread John Hardin

On Sat, 27 Apr 2013, Niamh Holding wrote:


Hello John,

Saturday, April 27, 2013, 12:50:34 AM, you wrote:

JH Simple rule: train any ham that doesn't hit BAYES_00.

???

What about ham that hits BAYES_00 and shows autolearn=no ?


If a ham hits BAYES_00 that means the Bayes system did a good job of 
recognizing it. Training on it won't hurt, but it won't help much either.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  ...in the 2nd amendment the right to arms clause means you have
  the right to choose how many arms you want, and the militia clause
  means that Congress can punish you if the answer is none.
-- David Hardy, 2nd Amendment scholar
---
 331 days since the first successful private support mission to ISS (SpaceX)


Re: BAYES question

2013-04-27 Thread Karsten Bräckelmann
On Sat, 2013-04-27 at 11:59 +0300, Jari Fredriksson wrote:
 27.04.2013 04:54, Karsten Bräckelmann kirjoitti:
  And it is good advice to keep the initial training corpora to a
  ratio roughly assembling your ham/spam ratio, or maybe 1/1. (At this
  point, we're approaching woodoo. Learning 10 times more ham than spam is
  most likely to be a bad choice, though.)
 
 I don't see any problem with having a corpus like this:

I don't see a problem there, either. And if you re-read the complete
paragraph, carefully avoiding overvaluing the voodoo marked comment,
you might realize I even suggested it. In your case.

You mentioned, you do not get much spam anyway. Moreover, you also
include mailing-lists in your ham corpus, whereas the average user
likely doesn't even participate on a single list.

Point being, am I correct in assuming these numbers roughly reflect your
ham/spam ratio?

 0.000  0  28252  0  non-token data: nspam
 0.000  0 187579  0  non-token data: nham
 
 I have no problems with Bayes whatsoever.

-- 
char *t=\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4;
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;il;i++){ i%8? c=1:
(c=*++x); c128  (s+=h); if (!(h=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}



Re: BAYES question

2013-04-27 Thread Jari Fredriksson
27.04.2013 23:15, Karsten Br�ckelmann kirjoitti:
 Point being, am I correct in assuming these numbers roughly reflect your
 ham/spam ratio?

  0.000  0  28252  0  non-token data: nspam
  0.000  0 187579  0  non-token data: nham
Yes. I want more spam, but it nowadays tries to evade me, dunno why.

-- 

Gratitude and treachery are merely the two extremities of the same procession.
You have seen all of it that is worth staying for when the band and the gaudy
officials have gone by.
-- Mark Twain, Pudd'nhead Wilson's Calendar




signature.asc
Description: OpenPGP digital signature


Re: BAYES question

2013-04-27 Thread Alex
Hi,

 To feed ham to bayes, should one only user mis-flagged mail, or may one
 use unflagged (below 5) mail?

 Expressed differently, can one feed good messages, sa-learn --ham
 path-to-ham  as one might feed missed spam, sa-learn --spam path-to-spam


 You can train hams that have scored high (i.e. misclassified hams) and you
 can proactively train low-scoring mail to try to avoid problems in the
 first place.


If there are some spam messages with BAYES_00, and the database needs to be
corrected, is it best to just learn it as spam, or use --forget, then
--spam?

I just grepped the quarantine and there were a handful of BAYES_00 with
overall scores between 6 and 10.

Thanks,
Alex


Re: BAYES question

2013-04-27 Thread Joe Acquisto-j4
 On 4/27/2013 at 1:20 PM, John Hardin jhar...@impsec.org wrote:
 On Fri, 26 Apr 2013, Joe Acquisto-j4 wrote:
 
 So, I could just feed a bunch of good mail, to --ham, and spam that is 
 correctly marked
 as spam as well as missed spam, to --spam?
 
 Correct; the important part is that what you train with must be *correctly 
 classified* - training a ham as spam is not helpful... :)
 
 Hang onto that as part of your base corpus in case you need to retrain.
 
 -- 
   John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ 
   jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org 
   key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
 ---
Your mouse has moved. Your Windows Operating System must be
relicensed due to this hardware change. Please contact Microsoft
to obtain a new activation key. If this hardware change results in
added functionality you may be subject to additional license fees.
Your system will now shut down. Thank you for choosing Microsoft.
 ---
   331 days since the first successful private support mission to ISS 
 (SpaceX)

Thanks. I have created YASF (yet another shared folder) to assist in this 
adventure.

joe a.



Re: BAYES question

2013-04-27 Thread Joe Acquisto-j4
 On 4/27/2013 at 11:17 AM, Benny Pedersen m...@junc.eu wrote:
 Joe Acquisto-j4 skrev den 2013-04-27 13:37:
 
 Very interesting.   However, I don't see any BAYES_xx markings in the
 headers at all.
 
 how is you bayes setup ?
 
 what gives sa-learn --dump magic ?
 
 I assumed that is because it is not scoring yet, due to low samples.
 Or some other reason.
 
 that could be the reason, others might be diff users bayes learning
 
 How do I find that number?
 
 --dump magic
 

I seem to have not stated my query clearly, as several have suggested this.
Or, it was perfectly understood, but I am not comprehending. 

I don't want to know how to see the tokens, etc (I do, but already know how).
I was curious about this BAYES_xx thing, which I gather is something I should
see in a message header.

joe a.



Re: BAYES question

2013-04-27 Thread John Hardin

On Sat, 27 Apr 2013, Alex wrote:


Hi,

To feed ham to bayes, should one only user mis-flagged mail, or may one

use unflagged (below 5) mail?

Expressed differently, can one feed good messages, sa-learn --ham
path-to-ham  as one might feed missed spam, sa-learn --spam path-to-spam



You can train hams that have scored high (i.e. misclassified hams) and you
can proactively train low-scoring mail to try to avoid problems in the
first place.


If there are some spam messages with BAYES_00, and the database needs to be
corrected, is it best to just learn it as spam, or use --forget, then
--spam?

I just grepped the quarantine and there were a handful of BAYES_00 with
overall scores between 6 and 10.


Just re-learn it as spam, that automatically forgets that it was ham.

--forget is only useful to completely remove that message from the 
database.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  Any time law enforcement becomes a revenue center, the system
  becomes corrupt.
---
 331 days since the first successful private support mission to ISS (SpaceX)


Re: BAYES question

2013-04-27 Thread John Hardin

On Sat, 27 Apr 2013, Joe Acquisto-j4 wrote:


I don't want to know how to see the tokens, etc (I do, but already know how).
I was curious about this BAYES_xx thing, which I gather is something I should
see in a message header.


Yes, the BAYES_## are rules that would show up in the hit-rules list 
in the processed message's headers - assuming bayes is working.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  Any time law enforcement becomes a revenue center, the system
  becomes corrupt.
---
 331 days since the first successful private support mission to ISS (SpaceX)


Re: BAYES question

2013-04-27 Thread Karsten Bräckelmann
On Sat, 2013-04-27 at 19:00 -0400, Joe Acquisto-j4 wrote:
   Very interesting.   However, I don't see any BAYES_xx markings in the
   headers at all.

   I assumed that is because it is not scoring yet, due to low samples.
   Or some other reason.
  
  that could be the reason, others might be diff users bayes learning
  
   How do I find that number?
  
  --dump magic
 
 I seem to have not stated my query clearly, as several have suggested this.
 Or, it was perfectly understood, but I am not comprehending. 
 
 I don't want to know how to see the tokens, etc (I do, but already know how).

You assumed Bayes might not working due to low samples, and asked how
to find that number. Are you not asking for the number of ham and spam
learned?

  sa-learn --dump magic

See the result of this command for the number of spam and ham learned
(nspam and nham respectively).

You must run that command as the user SA runs as when scanning incoming
mail. Which might be the recipient's system user, or a site-wide user
depending on your setup.

Obviously, initially training Bayes needs to be done as that very
user(s), too.

Which user(s) are that? Do you use site-wide or per-user configuration?


 I was curious about this BAYES_xx thing, which I gather is something I should
 see in a message header.

The BAYES_nn headers are rules reflecting the Bayesian probability of
the mail on a scale between ham (0.00) and spam (1.00). The two digit
number is the probability expressed in percent.

As has been pointed out at least twice in this thread, Bayes will not
start working after at least 200 ham and spam each have been trained.

How many did you train yet? (Hint: Output of above command.)

Also, of course, Bayes needs to be enabled. It is by default, though you
might want to cross-check with your site and/or user configuration. See
the section Learning Options in the docs.

  http://spamassassin.apache.org/full/3.3.x/doc/Mail_SpamAssassin_Conf.html


-- 
char *t=\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4;
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;il;i++){ i%8? c=1:
(c=*++x); c128  (s+=h); if (!(h=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}



Re: BAYES question

2013-04-26 Thread John Hardin

On Fri, 26 Apr 2013, Joe Acquisto-j4 wrote:

To feed ham to bayes, should one only user mis-flagged mail, or may 
one use unflagged (below 5) mail?


Expressed differently, can one feed good messages, sa-learn --ham 
path-to-ham  as one might feed missed spam, sa-learn --spam 
path-to-spam


You can train hams that have scored high (i.e. misclassified hams) and you 
can proactively train low-scoring mail to try to avoid problems in the 
first place.


Simple rule: train any ham that doesn't hit BAYES_00.

--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  The Tea Party wants to remove the Crony from Crony Capitalism.
  OWS wants to remove Capitalism from Crony Capitalism.
-- Astaghfirullah
---
 330 days since the first successful private support mission to ISS (SpaceX)


Re: BAYES question

2013-04-26 Thread Joe Acquisto-j4
 On 4/26/2013 at 7:50 PM, John Hardin jhar...@impsec.org wrote:
 On Fri, 26 Apr 2013, Joe Acquisto-j4 wrote:
 
 To feed ham to bayes, should one only user mis-flagged mail, or may 
 one use unflagged (below 5) mail?

 Expressed differently, can one feed good messages, sa-learn --ham 
 path-to-ham  as one might feed missed spam, sa-learn --spam 
 path-to-spam
 
 You can train hams that have scored high (i.e. misclassified hams) and you 
 can proactively train low-scoring mail to try to avoid problems in the 
 first place.
 
 Simple rule: train any ham that doesn't hit BAYES_00.
 
 -- 
   John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ 

Well, right now, there are no bayes hits at all.   I cleared bayes to re-train, 
after
correcting for a botched initial scheme.

While I am getting a fair amount of missed spam, there is very little 
mis-classified.

So I am looking for a way to speed up learning.

So, I could just feed a bunch of good mail, to --ham, and spam that is 
correctly marked
as spam as well as missed spam, to --spam?

or do I need a rest?

joe a.




Re: BAYES question

2013-04-26 Thread Karsten Bräckelmann
On Fri, 2013-04-26 at 21:25 -0400, Joe Acquisto-j4 wrote:
 Well, right now, there are no bayes hits at all.   I cleared bayes to
 re-train, after correcting for a botched initial scheme.
 
 While I am getting a fair amount of missed spam, there is very little
 mis-classified.
 
 So I am looking for a way to speed up learning.

Initial training. Train on existing, verified corpora.

 So, I could just feed a bunch of good mail, to --ham, and spam that is
 correctly marked as spam as well as missed spam, to --spam?

Yes. Bayes by default will not be used for scoring (it does learn,
though), unless at least 200 spam and ham each have been learned.

So by training, you can have Bayes kick in earlier.

Ham usually does not change much over time. Spam does, significantly.
Training 1000 ham received the last months, years, whatever, thus
generally is OK. You'd want to limit the time span for training spam,
though. And it is good advice to keep the initial training corpora to a
ratio roughly assembling your ham/spam ratio, or maybe 1/1. (At this
point, we're approaching woodoo. Learning 10 times more ham than spam is
most likely to be a bad choice, though.)


 or do I need a rest?

Dunno. Got a beer near you?


-- 
char *t=\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4;
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;il;i++){ i%8? c=1:
(c=*++x); c128  (s+=h); if (!(h=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}



Re: BAYES question

2013-04-26 Thread Karsten Bräckelmann
On Fri, 2013-04-26 at 19:38 -0400, Joe Acquisto-j4 wrote:
 To feed ham to bayes, should one only user mis-flagged mail, or may
 one use unflagged (below 5) mail?

The Bayesian classifier is a subsystem mostly independent from SA.

Most SA rules are rather white or black. Match, or don't. And scored
according to the probability of actually distinguishing ham from spam.
The higher the absolute score of a given rule, the higher the
probability to be ham (negative score) or spam (positive score). Mere
hints, but not reliable indicators, have low scores.

For a scoring system like SA, this is generically true. With different,
varying scales.

It is correct for single rules. Dunno would be a rule's score of zero.
The higher the score, the more spammy it is.

It is correct for the overall, resulting score of a message. The dunno
tipping point is 5 by default. A message scoring 4.5 is more likely ham,
though you'd better not bet on it.

And it also is correct for the Bayes subsystem, with a notable scale of
it's own -- ranging from 0 (ham) to 1 (spam), with 0.5 being a big fat
shrug. The BAYES_nn rules and their scores are set accordingly. BAYES_50
really should have no score.


Back to the question, and explaining why I mentioned the above.

mis-flagged mail, false positives and false negatives, do exist on
multiple levels. The OP mentioned it with respect to the *overall*
score. And asked about *Bayes* training.

Training Bayes, first and foremost, helps Bayes only. In the end, it
might make a significant difference overall, sure. However, when it
comes to the question whether training Bayes might help...

Look at the Bayesian probability. Not the overall SA score.

Do train those, which have a Bayesian probability close(r) to 0.5. Or
even worse, have a Bayesian probability contrary to the overall score,
or actual classification.

Training the plethora of spam hitting BAYES_99 might not be a mistake.
But it is pretty likely, to *not* improve general SA performance.

You're training Bayes. Not SpamAssassin.


-- 
char *t=\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4;
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;il;i++){ i%8? c=1:
(c=*++x); c128  (s+=h); if (!(h=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}



Re: Bayes Question

2007-04-23 Thread Matt Kettler
Craig wrote:
 Hello All-
  
 My bayes database seems to have problems and I would like suggestion
 on how to correct.  Here is my issue-
 I take any spam email from my users and run the following commands
 a. spamassassin -R name of spam file to check
 b. spamassassin -r name of spam file to check
 c. sa-learn --forget name of spam file to check
 d. sa-learn --spam name of spam file to check
1) running forget before training is redundant. SpamAssassin is smart
enough to realize when it is retraining a message that was previously
learned the wrong way and compensate correctly.

2) Running sa-learn --spam after spamassassin -r is redundant, unless
you've set bayes_learn_during_report to 0.

So really, you only need to do a and b. I assume you're trying to
over-do-it on purpose, but did want to point out what parts are
redundant for clarity sake.

  
 I re-run an email (spamassassin -D -t name of spam file to check
 name of spam file to check.txt)
  to check all is well-that bayes learned the email as spam.  Today
 after running the above I still have several messages with the
 following output info:
  
  
 Content analysis details:   (-0.1 points, 5.0 required)
  
  pts rule name  description
  --
 --
  0.1 FORGED_RCVD_HELO   Received: contains a forged HELO
 -0.2 BAYES_40   BODY: Bayesian spam probability is 20 to 40%
 [score: 0.2729]
 Thoughts?
My first thought is:

What was it's bayes score *before* you trained?

Training a single message as spam will not guarantee that it will
immediately get a high bayes score. If most of the tokens in the message
strongly match a large volume of nonspam training, it will take a
similar volume of spam training to overcome it. Otherwise one
mis-trained message could wildly upset your whole bayes database causing
large numbers of mis-marked messages.

If you really are seeing this problem a lot, you might want to take some
of the spams and run them through spamassassin -D bayes in order to
get the individual tokens and their scores to be printed out.
(Previously you used spamassassin -D -t, which isn't the same. That's
general debugging, but doesn't enable detailed bayes debugging.)











Re: Bayes question

2006-02-21 Thread Steven Stern

M. Lewis wrote:

I recently lost a hard drive and have had to setup everything again.

I'm seeing a fair amount of spam that is getting through my filters. 
 From what I can see in the headers of messages, bayes does not seem to 
be used at all. I'm reasonable sure this is the reason I'm seeing spam.


If I do #spamassassin -t -D  spam.txt   I can clearly see bayes is 
being used.


Suggestions for what to check?

Thanks for any ideas.
M



sa-learn --dump magic

What does it say?

--

  Steve


Re: Bayes question

2006-02-21 Thread Steven Stern

M. Lewis wrote:

Thanks Steve,

# sa-learn --dump magic
0.000  0  3  0  non-token data: bayes db version
0.000  0  57468  0  non-token data: nspam
0.000  0  16419  0  non-token data: nham
0.000  0 181931  0  non-token data: ntokens
0.000  0 1139892654  0  non-token data: oldest atime
0.000  0 1140583854  0  non-token data: newest atime
0.000  0  0  0  non-token data: last journal 
sync atime

0.000  0 1140584727  0  non-token data: last expiry atime
0.000  0 691200  0  non-token data: last expire 
atime delta
0.000  0   1510  0  non-token data: last expire 
reduction count





Please keep replies on the list

I was wondering if you'd had enough ham and spam to get past the 
minimums.  Looks like you have.


How about posting the output from

   spamassassin -D --lint


--

  Steve


Re: Bayes question

2006-02-21 Thread M. Lewis

Sorry, I am in the habit of 'reply' as opposed to 'reply all'.

I see no 'obvious' errors in spamassassin -D --lint which was the first 
thing I checked.


Shortly before you asked about the 'sa-learn --dump magic', I found this 
message from Matt:


http://marc.theaimsgroup.com/?l=spamassassin-usersm=113327783327806w=2

I did this and now I'm seeing bayes markups. So hopefully it was just a 
perms issue that is now resolved.


Thanks,
Mike


Steven Stern wrote:

M. Lewis wrote:


Thanks Steve,

# sa-learn --dump magic
0.000  0  3  0  non-token data: bayes db version
0.000  0  57468  0  non-token data: nspam
0.000  0  16419  0  non-token data: nham
0.000  0 181931  0  non-token data: ntokens
0.000  0 1139892654  0  non-token data: oldest atime
0.000  0 1140583854  0  non-token data: newest atime
0.000  0  0  0  non-token data: last journal 
sync atime

0.000  0 1140584727  0  non-token data: last expiry atime
0.000  0 691200  0  non-token data: last expire 
atime delta
0.000  0   1510  0  non-token data: last expire 
reduction count





Please keep replies on the list

I was wondering if you'd had enough ham and spam to get past the 
minimums.  Looks like you have.


How about posting the output from

   spamassassin -D --lint




--

 Those who can, do.  Those who cannot, teach.  Those who cannot teach, 
HACK!

  00:30:01 up 3 days, 35 min,  6 users,  load average: 0.54, 0.60, 0.58

 Linux Registered User #241685  http://counter.li.org


Re: bayes question (sa-learn)

2006-02-15 Thread Patrick von der Hagen

Philipp Snizek wrote:
[...]

However, I fear SA learns that headers coming from my internal MTA could be
spam and so causing false results on real spam. 
Exactly. Forwarding e-mail breaks the original information and has to be 
avoided.




What experiences have you made or how have you solved this ?
(e.g. by setting up an IMAPd on the spamgateway?)
You can configure an imapd wherever you want, there are many tools out 
there to fetch IMAP-mailboxes to a local maildir/mbox/anything, which 
can then be used by salearn.
I use cyrus and have to folders spamreport and hamreport which are 
shared amount myusers, which have write, but no read access.
Even if I ignored those folders, my users would just be happy to give 
feedback and contribute. ;-)

--
CU,
   Patrick.


Re: Bayes question

2005-07-27 Thread JamesDR

Robert Swan wrote:
I have a pair of Spamassassin servers filtering e-mail (Spamassassin 
3.0.4, spamd/spamc, Postfix, redhat 9) I was wondering if I could share 
the bayes database between the two server rather than having each with 
its own and having to do the salearn process twice.


 


Any Thoughts?

 

 

 

 


Robert

 

 

 

 

 

 


Peace he would say instead of goodbyepeace my brother.

 

Yes... Use the bayes (MY|Postgre)SQL modules, see the docs on how to set 
this up.


--
Thanks,
James



RE: Bayes question

2005-07-27 Thread Alan Fullmer








I attempted to do that once, with a
network file system, but it didnt seem to know how to handle the locking
properly. I know I did something wrong, so if anyone else has a solution,
Id also be happy to hear it! J



-Alan Fullmer

[EMAIL PROTECTED]

www.xnote.com

www.zoobuh.com

















From: Robert Swan
[mailto:[EMAIL PROTECTED] 
Sent: Wednesday, July 27, 2005
12:22 PM
To: users@spamassassin.apache.org
Subject: Bayes question





I have a pair of Spamassassin servers filtering e-mail
(Spamassassin 3.0.4, spamd/spamc, Postfix, redhat 9) I was wondering if I could
share the bayes database between the two server rather than having each with
its own and having to do the salearn process twice.



Any Thoughts?











Robert













Peace he would say instead of goodbyepeace my brother.












RE: Bayes question

2005-07-27 Thread Tyler Nally
Boy... anytime I've done some kind of network file sharing across
a system or two, I have never done it for good performance reasons...
only convenience sakes.  And even then, never large files.

Almost a decade ago when I was performing massive COBOL database
conversions to load data into flat files to be imported into a
relational database, I noticed a significant decrease in performance
of the machine that is accessing remotely stored files.  It was far
easier/faster to auto-ftp the half a gigabyte of information to another
machine so that it could have the information *local* and therefore it
can access the data extremely quickly.   Depending on the machine and
it's resources, I'd expect it to slow down it's processing between 25-40%
on the average.

If the data remained on a remote machine, then the CPU has to use
it's resources to handle the resources on the remote file system
as if it's a part of it's own.  It is then at the whim of a NFS
file system handle that may or may not stay fresh.  Even if the
machines are separated by a couple feet of cable .. for me .. back
then ... NFS wasn't reliable enough for me to be able to bank on it
being up.  Because when the remote NFS file handle went stale, it
caused the local machine to hang and drag.  Maybe NFS is better now
than back then... I don't know.

The machine doesn't make a network *call* to the other machine to
borrow it's resources, it uses it's own resources to access the
remote files as if they are local yet, it does it over a network
cable rather than the typical high-speed of motherboard's bus that
would access the local hard drive.

So... the only way I'd do this in this day and age would be to have
the kind of hardware that you could build a multi-node supercomputer
where they all share the same hard drive over a fiber optic network
with lightning quick hard disks on the server node as it shares its
resources with the worker nodes.  In that case, the networking element
has been removed from the equation as the slowest link in the chain
of events.

On Wed, July 27, 2005 16:37, Alan Fullmer said:
 I attempted to do that once, with a network file system, but it didn't seem
 to know how to handle the locking properly.  I know I did something wrong,
 so if anyone else has a solution, I'd also be happy to hear it! :-)


-- 
Tyler Nally
[EMAIL PROTECTED]




Re: Bayes question

2005-07-27 Thread Matt Kettler
Alan Fullmer wrote:
 I attempted to do that once, with a network file system, but it didn’t
 seem to know how to handle the locking properly.  I know I did something
 wrong, so if anyone else has a solution, I’d also be happy to hear it! J

As JamesDR suggested.. Do it right, use SQL. It's a database that's *designed*
to be accessed remotely. Trying to share a DB_File based database over NFS is
asking for poor performance and trouble.


Re: Bayes question

2005-04-14 Thread Matt Kettler
Joe Zitnik wrote:

I apologize if this has been asked before, but I need some
clarification.  If I have autolearn for ham set to 0, and the default
BAYES_00 score assigns mail a negative value, and a spam message comes
through with enough good text in it to give it a BAYES_00 and therefore
a negative value BUT it is not a message that has been learned before,
is there the potential for that mail to be learned as ham based on the
negative BAYES score assigned it?  
  

No. It's 100% impossible, as the bayes autolearner makes it's judgments
based on the score the message would have gotten if bayes was disabled.
That kind of self-feedback is exactly why this is done.

(Note that calculating the score as if bayes was disabled also
involves calculating the score using scoreset 0 or 1 instead of 2 or 3.)

The autolearner also ignores any userconf flagged rules, such as white
and blacklists.




Re: bayes question

2005-01-10 Thread Michael Parker
On Mon, Jan 10, 2005 at 04:22:03PM -0500, Sunny Forro wrote:
 Help!
   I know this has got to be the number 1 question. But I haven't
 had any luck with it:
 

Actually, it doesn't happen that often these days.

 I'm getting:
 Bayes: bayes db version 2 is not able to be used, aborting!
 
 errors. I followed the instructions in UPGRADE, i.e. I shutdown all
 running processes and verified there were no locks. Ran SA-LEARN
 --rebuild. Installed SpamAssassin 3.0.2. Ran SA-LEARN --SYNC and got the
 bayes db version 2 warning read the FAQ found the post on it's just
 a warning that'll go away. Ran SA-LEARN --SYNC again, same problem.
 Waited 2 days (about 1500 messages) and still receive same problem. My
 bayes db files (bayes.seen bayes.toks) are having their file dates and
 sizes kept up to date, but I still receive this error.
 
 I'm running SA 3.0.2 with MailScanner 4.37.7 (although I get the error
 no matter how I run SA, including spamassassin -t), on FreeBSD 4.9p13
 with Sendmail 8.13.2.
 

http://wiki.apache.org/spamassassin/BayesUpgradeError

Not clear from the above if you read that, but at the end it talked
about sending the -D output.  Without that, there isn't much that can
be done.

Michael


pgpJcCemQyTTu.pgp
Description: PGP signature


RE: bayes question

2005-01-10 Thread Sunny Forro
 either.



Elmer Steve Forro III (Sunny)
Assistant Manager of Information Systems
Compco Industries
400 West Railroad Street
Columbiana, OH 44511
Phone:   (330) 482-0200 x229
Fax: (330) 482-6429
Cell:(330) 240-6611
Email:   [EMAIL PROTECTED]
Web: http://www.compcoind.com/
 

-Original Message-
From: Michael Parker [mailto:[EMAIL PROTECTED] 
Sent: Monday, January 10, 2005 4:30 PM
To: Sunny Forro
Cc: users@spamassassin.apache.org
Subject: Re: bayes question

On Mon, Jan 10, 2005 at 04:22:03PM -0500, Sunny Forro wrote:
 Help!
   I know this has got to be the number 1 question. But I haven't
 had any luck with it:
 

Actually, it doesn't happen that often these days.

 I'm getting:
 Bayes: bayes db version 2 is not able to be used, aborting!
 
 errors. I followed the instructions in UPGRADE, i.e. I shutdown all
 running processes and verified there were no locks. Ran SA-LEARN
 --rebuild. Installed SpamAssassin 3.0.2. Ran SA-LEARN --SYNC and got
the
 bayes db version 2 warning read the FAQ found the post on it's
just
 a warning that'll go away. Ran SA-LEARN --SYNC again, same problem.
 Waited 2 days (about 1500 messages) and still receive same problem. My
 bayes db files (bayes.seen bayes.toks) are having their file dates and
 sizes kept up to date, but I still receive this error.
 
 I'm running SA 3.0.2 with MailScanner 4.37.7 (although I get the error
 no matter how I run SA, including spamassassin -t), on FreeBSD 4.9p13
 with Sendmail 8.13.2.
 

http://wiki.apache.org/spamassassin/BayesUpgradeError

Not clear from the above if you read that, but at the end it talked
about sending the -D output.  Without that, there isn't much that can
be done.

Michael



Re: bayes question

2005-01-10 Thread Michael Parker
On Mon, Jan 10, 2005 at 04:50:57PM -0500, Sunny Forro wrote:
 debug: bayes: found bayes db version 2
 bayes: bayes db version 2 is not able to be used, aborting! at
 /usr/local/lib/perl5/site_perl/5.8.4/Mail/SpamAssassin/BayesStore/DBM.pm
 line 160.

Ok, yeah, this is just a warning, no error, forget that it says
aborting, it is just aborting the check for if scanning is available.

 debug: bayes: found bayes db version 2
 debug: bayes: detected bayes db format 2, upgrading
 debug: bayes: upgrading database format from v2 to v3

Now we get to the meat of the matter.  Here is where we finally open
up a read/write connection and force the upgrade.  And it looks like
it finishes just fine.  If you run sa-learn -D --sync again does it
show the same upgrading message?

You're running this command as root, I assume.  Are you using a
bayes_path config option?  Are you using a sitewide bayes?  Are you
possibly just seeing this message multiple times as you run it as
different users?

Michael


pgpHJat0si3e7.pgp
Description: PGP signature


Re: bayes question

2005-01-10 Thread Michael Parker
In the future, please be sure to CC the list as well, so it can get
dumped into the archives for future use.

On Mon, Jan 10, 2005 at 06:13:16PM -0500, Sunny Forro wrote:
 Michael,
   I am running it as root. I get the error every time I run
 SA-LEARN -D --SYNC, I don't get bayes checking with spamassassin. I
 haven't been running it with a bayes_path option, my old SpamAssassin
 used /root/.spamassassin as the db path. This is a sitewide setup, it's
 used to filter emails coming in for some charitable organizations hosted
 on this box. I effectively get the same exact output every time I run
 sa-learn -d --sync with the exception of the number of tokens it ties to
 the db file. It still says upgrading database from version 2 to version
 3 every time.

Very odd.  It is possible that there is some sort of db corruption
that is causing a strange failure.  Are there any extra files in
/root/.spamassassin?

Here are a few stabs in the dark that may or may not help.

Try setting bayes_path and bayes_file_mode and running the sync
again.  Read up on sitewide bayes on the wiki.

You could try to do a sa-learn --backup and then a sa-learn --restore
to see if that fixed the problem.

Did you move this db from another machine?  Maybe it is a Berkeley DB
library conflict?  Perhaps a db_dump and db_load (see wiki for info)
would help.  For that matter, you might try a sa-learn --import first
and see if that helps.

Worst case, blow away the database files and start from scratch.

Michael


pgpSC8xIDdSKh.pgp
Description: PGP signature


Re: Bayes question

2004-12-21 Thread Jon Drukman
Chuck Campbell wrote:
On Mon, Dec 20, 2004 at 12:56:43PM -0600, Steve Bondy wrote:
For example, the default score in 2.6.x for BAYES_90 is either 2.454 or
2.101.  If that's the only rule you hit, and your threshold is above
those numbers, it will come through.
But what if you repeatedly learn the message(s) in question as spam?  
Shouldn't bayes start to give it higher scores?  If it becomes a near perfect 
match, it should get a bayes_99, right?
true, but by default BAYES_99 alone still won't mark a message as spam. 
  the default BAYES_99 score is either 4.07 or 1.886, and the default 
for spam is 5.0.

also bayes won't learn the *exact* same message repeatedly.  if it's 
already seen a message it won't process it at all.  i'm not sure if it 
works off the message-id or a hash of the message content.

i set BAYES_99 to a very high score for my personal setup, because i 
have never seen a legit message yet that triggered that rule.

-jsd-


Re: Bayes question

2004-12-21 Thread Theo Van Dinter
On Mon, Dec 20, 2004 at 08:28:45PM -0800, Jon Drukman wrote:
 also bayes won't learn the *exact* same message repeatedly.  if it's 
 already seen a message it won't process it at all.  i'm not sure if it 
 works off the message-id or a hash of the message content.

Just for clarification, it's a SHA1 hash of several message headers and a
section of the body.  It's not (anymore) simply the Message-Id header. :)

-- 
Randomly Generated Tagline:
Let's start by ... spelling the word correctly...   - Roxanne Tisch


pgpafp2RNSKY1.pgp
Description: PGP signature


RE: Bayes question

2004-12-20 Thread Steve Bondy
Just because you learn something as spam doesn't mean it will be
blocked.
SA will add a score to the message based on the bayes rules, but if the
bayes rules are the only ones that get hit, and they score less than
your threshold, it won't keep the stuff out.
For example, the default score in 2.6.x for BAYES_90 is either 2.454 or
2.101.  If that's the only rule you hit, and your threshold is above
those numbers, it will come through.

 -Original Message-
 From: Chuck Campbell [mailto:[EMAIL PROTECTED] 
 Sent: Monday, December 20, 2004 12:02 PM
 To: SpamAssassin Users
 Subject: Bayes question
 
 
 Lately I've been seeing lots of very similar spams get 
 through my 2.6.3 I don't run autolearn, but I save my spam 
 and ham daily, and run them through sa-learn -spam and -ham 
 respectively.
 
 I'm puzzled why a spam I've manually learned via sa-learn 
 keeps coming through.
 
 What can I check to ensure things are working properly?
 
 BTW, I know I should upgrade, but time isn't available right 
 now, and this setup is catching more than 99.5 percent of the 
 spam coming in.  I'm just curious about bayes not working as 
 expected any longer, although it still catches LOTS of 
 others, so that can't be it completely...
 
 baffled,
 -chuck
 
 


Re: Bayes question

2004-12-20 Thread Chuck Campbell
On Mon, Dec 20, 2004 at 12:56:43PM -0600, Steve Bondy wrote:
 Just because you learn something as spam doesn't mean it will be
 blocked.
 SA will add a score to the message based on the bayes rules, but if the
 bayes rules are the only ones that get hit, and they score less than
 your threshold, it won't keep the stuff out.
 For example, the default score in 2.6.x for BAYES_90 is either 2.454 or
 2.101.  If that's the only rule you hit, and your threshold is above
 those numbers, it will come through.
 

But what if you repeatedly learn the message(s) in question as spam?  
Shouldn't bayes start to give it higher scores?  If it becomes a near perfect 
match, it should get a bayes_99, right?

-chuck




RE: Bayes question

2004-12-20 Thread Steve Bondy
I'm no expert on Bayes, but as far as I know, repeatedly learning the
same message over and over again doesn't do you any good.  Once the
tokens are in there, that's it.  The bayes score goes up as more tokens
in the message match 

Someone please correct me if I'm wrong, and confirm if I'm right... It
would help me out too.

Steve

 -Original Message-
 From: Chuck Campbell [mailto:[EMAIL PROTECTED] 
 Sent: Monday, December 20, 2004 3:54 PM
 To: Steve Bondy
 Cc: SpamAssassin Users
 Subject: Re: Bayes question
 
 
 On Mon, Dec 20, 2004 at 12:56:43PM -0600, Steve Bondy wrote:
  Just because you learn something as spam doesn't mean it will be 
  blocked. SA will add a score to the message based on the 
 bayes rules, 
  but if the bayes rules are the only ones that get hit, and 
 they score 
  less than your threshold, it won't keep the stuff out.
  For example, the default score in 2.6.x for BAYES_90 is 
 either 2.454 or
  2.101.  If that's the only rule you hit, and your threshold is above
  those numbers, it will come through.
  
 
 But what if you repeatedly learn the message(s) in question as spam?  
 Shouldn't bayes start to give it higher scores?  If it 
 becomes a near perfect 
 match, it should get a bayes_99, right?
 
 -chuck
 
 
 


Re: Bayes question

2004-12-20 Thread Chuck Campbell
On Mon, Dec 20, 2004 at 04:13:44PM -0600, Steve Bondy wrote:
 I'm no expert on Bayes, but as far as I know, repeatedly learning the
 same message over and over again doesn't do you any good.  Once the
 tokens are in there, that's it.  The bayes score goes up as more tokens
 in the message match 

It's not the same message... exactly.  It is the same spam, coming from many
different senders, each with a unique message ID.  I keep getting more of them,
and I keep learning them with sa-learn.

I'm just not getting SA to notice them as spam.

-chuck



Re: Bayes question

2004-12-20 Thread Michael Parker
On Mon, Dec 20, 2004 at 04:18:58PM -0600, Chuck Campbell wrote:
 It's not the same message... exactly.  It is the same spam, coming from many
 different senders, each with a unique message ID.  I keep getting more of 
 them,
 and I keep learning them with sa-learn.
 
 I'm just not getting SA to notice them as spam.
 

What rules are hitting?  Is BAYES_99 one of them?

Michael


pgpMsoBR1DAtl.pgp
Description: PGP signature


RE: Bayes question

2004-12-20 Thread Steve Bondy

 
 On Mon, Dec 20, 2004 at 04:13:44PM -0600, Steve Bondy wrote:
  I'm no expert on Bayes, but as far as I know, repeatedly 
 learning the 
  same message over and over again doesn't do you any good.  Once the 
  tokens are in there, that's it.  The bayes score goes up as more 
  tokens in the message match
 
 It's not the same message... exactly.  It is the same spam, 
 coming from many different senders, each with a unique 
 message ID.  I keep getting more of them, and I keep learning 
 them with sa-learn.
 
 I'm just not getting SA to notice them as spam.
 
 -chuck
 
 

So the message content is the same, but coming from different sources?


RE: Bayes question

2004-12-06 Thread Gray, Richard
Title: Re: Bayes question





 So, what happens when you take these two 
overlapping databases and combine them is that certain tokens (those 
that have overlap) are then double counted. This makes the 
database, at least according to the bayes model SA is using, 
statistically invalid.
Using this reasoning, the tokens that overlap are going to be 
identified as being related to the same message based on the same hashes. 
Therfore it should be possible to detect the tokens that are being double 
counted, and to dismiss them when they do. 
If you can do this then surely the database remains 
statistically correct and can be safely merged?



---
This email from dns has been validated by dnsMSS Managed Email Security and is free from all known viruses.

For further information contact [EMAIL PROTECTED]







Re: Bayes question

2004-12-05 Thread Ricardo Oliveira
Michael,

 I understood the dangers behing the theory - I'll get into the
analysis of all the bayes databases later on.

 I guess the only way to do it cleanly is to feed the same HAM+SPAM
messages to all the bayes's learning mechanisms...

Thanks for your time,
 Ricardo


Re: Bayes question

2004-12-04 Thread Ricardo Oliveira
According to the docs, --restore is destructive (in the sense it
destroys the previous contents of the database).

Would you guys be interested in such a feature? I plan to use a
generic bayes DB (which is maintained by our tech team), and merge it
with each clients's own DB (which would result in a highly accurate,
well-trained bayes mechanism). Anyone care to share your thoughts on
this?

TIA,
 Ricardo


Re: Bayes question

2004-12-04 Thread Michael Parker
On Sat, Dec 04, 2004 at 10:46:22AM +, Ricardo Oliveira wrote:
 According to the docs, --restore is destructive (in the sense it
 destroys the previous contents of the database).
 
 Would you guys be interested in such a feature? I plan to use a
 generic bayes DB (which is maintained by our tech team), and merge it
 with each clients's own DB (which would result in a highly accurate,
 well-trained bayes mechanism). Anyone care to share your thoughts on
 this?

No, this is not a good idea, please don't make a tool like this
generally available, here is the reason:

When you learn tokens from a message those tokens are added to the
database, or if they already exist their counts are increased, either
as spam or ham depending on how you are learning.  At the same time a
notation is made that you learned that message by storing, in later
versions, a pseudo message id (it's basically the SHA1 hash of several
pieces of data that should be unique) so that bayes will not re-learn
the tokens from that message. 
When you take two different bayes databases that have been learning
separately for any length of time you are bound to have overlap in the
messages they learned.  Everyone gets the same spam and if the
database is from someone you do business with, have relationship with
or share the same interests you are bound to have ham overlap as well.

So, what happens when you take these two overlapping databases and
combine them is that certain tokens (those that have overlap) are then
double counted.  This makes the database, at least according to the
bayes model SA is using, statistically invalid.

Now, that being said, lets say you did an analysis and found that the
two databases had no overlap, or at least very little (I have no idea
what very little would mean in this case).  You could probably
convince yourself, and it's math and statistics so I'm horrible at it
but I'd beat some folks on this list could provide a formula, that the
amount of overlap is statistically insignificant.  If you could do
that then you could combine the databases, in which case I leave it as
an exercise to the reader.

When calculating overlap it is VERY important to remember this.  The
pseudo message ids that are stored in the seen database, they changed
in the middle of the 3.0 development cycle.  So, if you used bayes in
SA in a version  3.0 you will have mixed message ids in your
database.  In this case it may be difficult to determine how much
overlap your databases have.

If you do write such a tool, I ask that you not make it available.
There are several issues that someone attempting this should study
carefully and a simple tool makes it too easy to ignore those issues
and it could leave to a broken bayes database in the end.

Michael


pgp6Fajw4ZlQ6.pgp
Description: PGP signature


Re: Bayes question

2004-12-03 Thread Ricardo Oliveira
What about joining several databases together? 

I'd like to use a general bayes DB, and join it with some clients's
particular DB's.

TIA,
 Ricardo


Re: Bayes question

2004-12-03 Thread Mike
On Fri, 3 Dec 2004 19:37:05 +, Ricardo Oliveira
[EMAIL PROTECTED] wrote:
 What about joining several databases together?
 
 I'd like to use a general bayes DB, and join it with some clients's
 particular DB's.
 
 TIA,
  Ricardo
 


Never tried it, but it should be possible with sa-learn --backup and
sa-learn --restore.

Mike


Re: Bayes question

2004-12-02 Thread Ricardo Oliveira
By the way - are the bayes databases on disk portable (in the sense I
could import or copy them to another server and use them
accordingly)?

Thanks in advance


Re: Bayes question

2004-12-02 Thread Mike
On Thu, 2 Dec 2004 22:27:05 +, Ricardo Oliveira
[EMAIL PROTECTED] wrote:
 By the way - are the bayes databases on disk portable (in the sense I
 could import or copy them to another server and use them
 accordingly)?
 
 Thanks in advance
 

I haven't had a problem doing that, moving from one Sparc to another. 

Mike


Re: Bayes question

2004-11-24 Thread Rakesh
Austin Weidner wrote:
Really trying to figure out bayes. Auto learn is set up, and my headers are
showing autolearn=spam
However, when I do sa-learn --dump magic, there are zero spams and zero
hams.
By using the -D (debug) option, I can see sa-learn is looking at:
debug: bayes: 17216 tie-ing to DB file R/O /root/.spamassassin/bayes_toks
debug: bayes: 17216 tie-ing to DB file R/O /root/.spamassassin/bayes_seen
When I get a new spam, these files are NOT being updated. The files being
updated are in:
/var/spool/mqueue/.spamassassin
How do I sort this out? Autolearn seems to be feeding the files in the
mqueue directory, but sa-learn (and therefore I would think spamassassin
itself) wants it in /root/.spamassassin
This is a MailScanner/SA installation. I've tried to set the path in the
spam.assassin.prefs.conf file to:
bayes_path /root/.spamassassin/bayes
bayes_file_mode 0660
But this didn't do anything. In fact, when I did this, autolearn=spam
stopped showing up in headers.
Any ideas?
 

Did you create a softlink of local.cf in /etc/mail/spamassassin to your 
spam.assassin.prefs.conf . Which ever path of bayes you set in local.cf 
spamassassin will follow that path

--
Regards, 
Rakesh B. Pal
Emergic CleanMail Team.
Netcore Solutions Pvt. Ltd.

==
perl -emap{y/a-z/l-za-k/;print}shift Jjhi pcdiwtg Ptga wprztg,
==

--
Netcore's New Website
http://www.netcore.co.in
--


Re: Bayes question

2004-11-23 Thread Matt Kettler
At 01:58 AM 11/23/2004 -0500, Austin Weidner wrote:
Really trying to figure out bayes. Auto learn is set up, and my headers are
showing autolearn=spam
However, when I do sa-learn --dump magic, there are zero spams and zero
hams.
By using the -D (debug) option, I can see sa-learn is looking at:
debug: bayes: 17216 tie-ing to DB file R/O /root/.spamassassin/bayes_toks
debug: bayes: 17216 tie-ing to DB file R/O /root/.spamassassin/bayes_seen
When I get a new spam, these files are NOT being updated. The files being
updated are in:
/var/spool/mqueue/.spamassassin
What's happening is that your mail is being processed as a non-root user, 
probably something like mail sendmail or some similar user that has 
mqueue as it's homedir. You can look at the owner of the bayes files to see 
what user it is running as.

probably the best option is to use sa-learn parameters to tell it where the 
db is.

sa-learn --dbpath /var/spool/mqueue/.spamassassin
also be sure to chown those files back to their original owner when you're done