Re: Training Q

2008-01-17 Thread Matthias Haegele

John D. Hardin schrieb:

On Wed, 16 Jan 2008 [EMAIL PROTECTED] wrote:


So, all 3 categories include emails that SA has already seen and
presumably included in its Bayesian filters,


Only if you have autolearn enabled. Can we assume that you do from 
this question? You didn't explicitly say.



and emails that it has never seen.

My question is, should I write a program to take out emails that
SA has already seen before I send them through Bayesian
processing, or is it smart enough not to process those again?


sa-learn won't re-learn messages it has already seen unless you change
their classification (e.g. was ham, re-learn as spam). Don't worry
about it.

In addition, keeping a full corpus around helps re-learning from
scratch should you ever need to do so.


Some people advise not to relearn old spam what would you suggest,
learn only last 6 month e.g.?

--
Gruesse/Greetings
MH


Dont send mail to: [EMAIL PROTECTED]
--



Re: Training Q

2008-01-17 Thread Matthias Haegele

Matthias Haegele schrieb:

Some people advise not to relearn old spam what would you suggest,
learn only last 6 month e.g.?


I meant if you must relearn from scratch how far you would go back?


--
Gruesse/Greetings
MH


Dont send mail to: [EMAIL PROTECTED]
--



Re: Training Q

2008-01-17 Thread Loren Wilton

Some people advise not to relearn old spam what would you suggest,
learn only last 6 month e.g.?


I'd suggest only the last 3 months or less of spam if you have enough.  Old 
ham should be fine though.


   Loren




Problem with sa-learn and virtual user

2008-01-17 Thread Jean-Edouard Babin
Hi,
My mail system use virtual user.
I use spamd like this
spamd --virtual-config-dir=/srv/spamassassin/%d/%l -x -u dovecot -c -i
127.0.0.1 -d -r /var/run/spamd.pid
I run spamc with
/usr/pkg/bin/spamc -u ${recipient} -f -e ...

This work fine, each user can use it's own sa config.

But i would like to be able to run sa-learn for spefic users

I tryed sa-lean --username [EMAIL PROTECTED] --spam files

But as I can see with debug (-D) it use bayes file of the unix user running
the command.

So i'm wondering how i sould do, I'm especially wondering how sa-learn can
now that it should use /srv/spamassassin/%d/%l for bayes file but did not
find the answer.


Re: A rule to match patterns on recipient name.

2008-01-17 Thread Steve

Loren Wilton wrote:
Valid email addresses have a well-known structure (i.e. [A-z.]*_NAME) 
so, for example [EMAIL PROTECTED] is clearly a bogus address.


Off the top of my head you might be able to do something like (untested):

header__GOOD_NAMETo=~ 
/[A-Za-z]{1,30}_[A-Za-z\d\.]{2,40}\@(?i:domain\.com)/

metaBAD_NAME!__GOOD_NAME
scoreBAD_NAME2

Above is based on the assumption that NAME includes only letters, 
numbers, and dots.  If it can also have underscores then you could 
just do \w{2,40} or the like for the second part.
Hmmm - not a bad start, I guess.  If I were to put something like this 
in individual users' .spamassassin/user_prefs - then I could be even 
more restrictive about NAME.  I am concerned, however, that this might 
not cope well with mailing lists (where To is the mailing list name) or 
in circumstances where the user is CC'd rather than addressed directly.





Re: Problem with sa-learn and virtual user

2008-01-17 Thread Jonathan Armitage

Jean-Edouard Babin wrote:

On Jan 17, 2008 1:38 PM, Jonathan Armitage [EMAIL PROTECTED]
wrote:


Jean-Edouard Babin wrote:

But i would like to be able to run sa-learn for spefic users

I tryed sa-lean --username [EMAIL PROTECTED] --spam files

But as I can see with debug (-D) it use bayes file of the unix user

running

the command.


Try su - username -c sa-learn --spam spamdir

Jon



User are virtual.


I think there is another way, but can't remember offhand.

Look back through the mailing list. It was discussed a few months ago.

Jon


Re: Problem with sa-learn and virtual user

2008-01-17 Thread Matt Kettler

Jean-Edouard Babin wrote:

Hi,

My mail system use virtual user.
I use spamd like this
spamd --virtual-config-dir=/srv/spamassassin/%d/%l -x -u dovecot -c -i 
127.0.0.1 http://127.0.0.1 -d -r /var/run/spamd.pid 
I run spamc with

/usr/pkg/bin/spamc -u ${recipient} -f -e ...

This work fine, each user can use it's own sa config.

But i would like to be able to run sa-learn for spefic users

I tryed sa-lean --username [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] 
--spam files


But as I can see with debug (-D) it use bayes file of the unix user 
running the command.
Yes, the --username in sa-learn *ONLY* works with SQL backends, as per 
the docs.


So i'm wondering how i sould do, I'm especially wondering how sa-learn 
can now that it should use /srv/spamassassin/%d/%l for bayes file but 
did not find the answer.


you'd have to use the --dbpath option to over-ride SA's default idea of 
where the bayes database lives.




sa-learn error message

2008-01-17 Thread merikay

Hi again SA experts,

Note the error message in the 2nd-last line of the following transcript:

animalhead:~/sj $ sa-learn --no-rebuild --spam --mbox savejunk
The --no-rebuild option has been deprecated.  Please use --no-sync  
instead.

Learned tokens from 3025 message(s) (3047 message(s) examined)
animalhead:~/sj $ sa-learn --no-sync --spam thruJunk
bayes: bayes db version 0 is not able to be used, aborting! at /usr/ 
local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/BayesStore/DBM.pm  
line 196.

Learned tokens from 170 message(s) (170 message(s) examined)

There are 171 messages in directory thruJunk.  The largest is 495K,  
the next largest is 137K.

$ sa-learn -Vyields spamassassin v 3.2.1

What should I do about this?

I still have another directory with ham to go.  It includes lots of  
large files.  Should I delete those over a certain size?


Thanks,
Craig MacKenna



Re: A rule to match patterns on recipient name.

2008-01-17 Thread Steve Haeck

Bowie Bailey wrote:

Catch-all setups always have this problem.  You could use SA to figure
out which addresses are likely to be valid, but this means that you have
to accept the message and then call SA for EVERY one of these emails.
  

I'm aware of that... but the benefits outweigh the problems.

The best way is to use your MTA.  Set up a method for your users to
create these email addresses as real email aliases in your MTA.  Then
you can set your MTA to only accept valid email addresses and the
problem goes away.
  
That would be a problem - since there is no definitive list of 'valid' 
email addresses - however I do know the form of all valid email 
addresses.  If I could replace list-based lookup with a function to 
parse and validate email addresses with my MTA, I'd be laughing.


It's no big problem to process every spam - but it would be desirable to 
at least mark all these made-up email address destinations with a higher 
spam score.  Is there an existing rule I can customise, or would I have 
to start from scratch?






RE: A rule to match patterns on recipient name.

2008-01-17 Thread Bowie Bailey
Steve wrote:
 Loren Wilton wrote:
   Valid email addresses have a well-known structure (i.e.
   [A-z.]*_NAME) so, for example [EMAIL PROTECTED] is clearly a
   bogus address. 
  
  Off the top of my head you might be able to do something like
  (untested): 
  
  header__GOOD_NAMETo=~
  /[A-Za-z]{1,30}_[A-Za-z\d\.]{2,40}\@(?i:domain\.com)/
  metaBAD_NAME!__GOOD_NAME
  scoreBAD_NAME2
  
  Above is based on the assumption that NAME includes only letters,
  numbers, and dots.  If it can also have underscores then you could
  just do \w{2,40} or the like for the second part.
 
 Hmmm - not a bad start, I guess.  If I were to put something like this
 in individual users' .spamassassin/user_prefs - then I could be even
 more restrictive about NAME.  I am concerned, however, that this might
 not cope well with mailing lists (where To is the mailing list name)
 or in circumstances where the user is CC'd rather than addressed
 directly. 

That can be fixed by having the MTA (or MDA) add a Delivered-To header
indicating the user the message is being delivered to.  Then you can use
this header rather than having to rely on something sensible being in
the To or Cc headers.

-- 
Bowie


RE: A rule to match patterns on recipient name.

2008-01-17 Thread Bowie Bailey
Steve Haeck wrote:
 Bowie Bailey wrote:
  
  The best way is to use your MTA.  Set up a method for your users to
  create these email addresses as real email aliases in your MTA. 
  Then you can set your MTA to only accept valid email addresses and
  the problem goes away. 
 
 That would be a problem - since there is no definitive list of 'valid'
 email addresses - however I do know the form of all valid email
 addresses.  If I could replace list-based lookup with a function to
 parse and validate email addresses with my MTA, I'd be laughing.

That should be possible.  The amount of work necessary will depend on
your MTA.  With Courier, you can write a filter module in Perl or PHP.
I don't know about the others.

 It's no big problem to process every spam - but it would be desirable
 to at least mark all these made-up email address destinations with a
 higher spam score.  Is there an existing rule I can customise, or
 would I have to start from scratch?

The problems with processing every spam tend to rise with the amount of
junkmail you receive.  I had to rework my mail setup here a few years
ago when the amount of spam coming in was more than my system could deal
with.  Once my front-line mail server could reject invalid email
addresses, my mail volume dropped way down and allowed my servers to
keep up without a spam blast causing massive delays.

There is no existing rule, but writing one shouldn't be difficult.  One
suggestion has already been given.  The main hassle will be with mailing
lists and such which send via BCC.

-- 
Bowie


Re: A rule to match patterns on recipient name.

2008-01-17 Thread mouss

Steve wrote:

Bowie Bailey wrote:

That can be fixed by having the MTA (or MDA) add a Delivered-To header
indicating the user the message is being delivered to.  Then you can use
this header rather than having to rely on something sensible being in
the To or Cc headers.
I always wondered where Delivered-To was added - and why some 
messages I've seen have it and others don't.


Time to break out the postfix manual... :-)


if delivering via a pipe, set the 'D' flag.

note that you can configure postfix to reject invalid addresses 
instead of doing this in SA.


an alternative to your scheme is to use address extensions. for example, 
using '-' as the delimiter ('+' is refused by many sites), each user can 
receive mail for [EMAIL PROTECTED]


you can also give each users two addresses, say steve.haeck and 
steve. the first is used privately, the second is always used with 
an extension. so you can reject mail to [EMAIL PROTECTED] and accept 
[EMAIL PROTECTED]





Re: A rule to match patterns on recipient name.

2008-01-17 Thread Steve

Bowie Bailey wrote:

That can be fixed by having the MTA (or MDA) add a Delivered-To header
indicating the user the message is being delivered to.  Then you can use
this header rather than having to rely on something sensible being in
the To or Cc headers.
I always wondered where Delivered-To was added - and why some messages 
I've seen have it and others don't.


Time to break out the postfix manual... :-)

Thanks,
Steve




Re: Problem with sa-learn and virtual user

2008-01-17 Thread Jean-Edouard Babin
On Jan 17, 2008 2:31 PM, Matt Kettler [EMAIL PROTECTED] wrote:

 Jean-Edouard Babin wrote:
  Hi,
 
  My mail system use virtual user.
  I use spamd like this
  spamd --virtual-config-dir=/srv/spamassassin/%d/%l -x -u dovecot -c -i
  127.0.0.1 http://127.0.0.1 -d -r /var/run/spamd.pid
  I run spamc with
  /usr/pkg/bin/spamc -u ${recipient} -f -e ...
 
  This work fine, each user can use it's own sa config.
 
  But i would like to be able to run sa-learn for spefic users
 
  I tryed sa-lean --username [EMAIL PROTECTED] mailto:[EMAIL PROTECTED]
  --spam files
 
  But as I can see with debug (-D) it use bayes file of the unix user
  running the command.
 Yes, the --username in sa-learn *ONLY* works with SQL backends, as per
 the docs.


Yes thanks, that what i see in the doc after upgrading


 
  So i'm wondering how i sould do, I'm especially wondering how sa-learn
  can now that it should use /srv/spamassassin/%d/%l for bayes file but
  did not find the answer.
 
 you'd have to use the --dbpath option to over-ride SA's default idea of
 where the bayes database lives.


Ok, -p seem to be fine also, but it's of course a less automatic solution.

thanks,


Re: A rule to match patterns on recipient name.

2008-01-17 Thread Loren Wilton
header__GOOD_NAMETo=~ 
/[A-Za-z]{1,30}_[A-Za-z\d\.]{2,40}\@(?i:domain\.com)/

metaBAD_NAME!__GOOD_NAME
scoreBAD_NAME2

Above is based on the assumption that NAME includes only letters, 
numbers, and dots.  If it can also have underscores then you could just 
do \w{2,40} or the like for the second part.


Hmmm - not a bad start, I guess.  If I were to put something like this in 
individual users' .spamassassin/user_prefs - then I could be even more 
restrictive about NAME.  I am concerned, however, that this might not cope 
well with mailing lists (where To is the mailing list name) or in 
circumstances where the user is CC'd rather than addressed directly.


It will surely fail on mailing lists and Bcc items, which is why I gave it a 
relatively low score.
You had seemingly specifically said To previously.  You can use ToCc in 
place of To in the rule and catch both To and CC.


   Loren




Re: How to skip checking emails over a certain size?

2008-01-17 Thread Henry Kwan
Theo Van Dinter felicity at apache.org writes:

  
  spamd[2492]: razor2: razor2 check failed: razor2: razor2 had unknown error
  during check at
  /usr/lib/perl5/site_perl/5.8.5/Mail/SpamAssassin/Plugin/Razor2.pm line 211,
  GEN25 line 1. at
  /usr/lib/perl5/site_perl/5.8.5/Mail/SpamAssassin/Plugin/Razor2.pm line 326. 
 
 Run spamd w/ -D (or -D razor2 for more output) and find out if there's
 any actual error messages for one of the problematic messages.
 

Hi,

Apparently, it was unrelated to the size of the email.  It was some type of
registration error with razor2.  If I did a razor-admin -register, it would
abort with an Error 202.

Couldn't figure out why this was happening but installing the latest version
(2.84) over my older razor2 install (2.67) seems to make razor-admin -register
work again.

Am still getting these razor2 had unknown error entries but instead of a dozen
or so each hour, it's more like 1 every few hours now.  But the emails are
properly getting tagged with RAZOR2_CHECK so I guess it's working OK.

Thanks.





SA: failed to run header tests, skipping some.

2008-01-17 Thread Michael Hutchinson
Hello everyone, 

 

I have been having some issues with Spamassassin and have been ironing
things out (like child processes not becoming re-usable), but there is
one that floors me (probably because I'm not a perl expert.. but hey).
Anyway here goes my configuration and the errors I am seeing, I hope
someone with a kind heart can help me out.

 

Mail server information:

 

Operating System is Debian 3.1 (Sarge)

MTA is Qmail (as per Shupp's toaster)

Spamassassin is version 3.1.4

ClamAV is version 0.92

Perl is version 5.8.4

Spamassassin invocation is by init.d :

spamd -q -x -m 5 -H -d --pidfile=/var/run/spamd.pid

 

Ok, so now for the errors that I am getting for each email receipted by
our mail server, from /var/log/mail.warn:

 

Jan 18 10:14:24 tuatara spamd[18169]: Number found where operator
expected at (eval 878) line 10, near }

Jan 18 10:14:24 tuatara spamd[18169]:

Jan 18 10:14:24 tuatara spamd[18169]:  1

Jan 18 10:14:24 tuatara spamd[18169]:  (Missing operator before

Jan 18 10:14:24 tuatara spamd[18169]:

Jan 18 10:14:24 tuatara spamd[18169]:  1?)

Jan 18 10:14:24 tuatara spamd[18169]: rules: failed to run header tests,
skipping some: syntax error at (eval 878) line 11, near ;

Jan 18 10:14:24 tuatara spamd[18169]: }

Jan 18 10:14:24 tuatara spamd[18169]: Use of uninitialized value in
concatenation (.) or string at
/usr/share/perl5/Mail/SpamAssassin/PerMsgStatus.pm line 2656, GEN10
line 28.

Jan 18 10:14:24 tuatara spamd[18169]: Use of uninitialized value in
concatenation (.) or string at
/usr/share/perl5/Mail/SpamAssassin/PerMsgStatus.pm line 2657, GEN10
line 28.

Jan 18 10:14:24 tuatara last message repeated 2 times

Jan 18 10:14:24 tuatara spamd[18169]: Number found where operator
expected at (eval 879) line 10, near }

Jan 18 10:14:24 tuatara spamd[18169]:

Jan 18 10:14:24 tuatara spamd[18169]:  1

Jan 18 10:14:24 tuatara spamd[18169]:  (Missing operator before

Jan 18 10:14:24 tuatara spamd[18169]:

Jan 18 10:14:24 tuatara spamd[18169]:  1?)

Jan 18 10:14:24 tuatara spamd[18169]: rules: failed to run header tests,
skipping some: syntax error at (eval 879) line 11, near ;

Jan 18 10:14:24 tuatara spamd[18169]: }

Jan 18 10:14:24 tuatara spamd[18169]: Use of uninitialized value in
concatenation (.) or string at
/usr/share/perl5/Mail/SpamAssassin/PerMsgStatus.pm line 2656, GEN10
line 28.

Jan 18 10:14:24 tuatara spamd[18169]: Use of uninitialized value in
concatenation (.) or string at
/usr/share/perl5/Mail/SpamAssassin/PerMsgStatus.pm line 2657, GEN10
line 28.

Jan 18 10:14:24 tuatara last message repeated 2 times

 

I am more than happy to attach my PerMsgStatus.pm if anyone would like
to peruse it.

I have attempted to find the problem in this file, but don't understand
it enough, or the problem is not actually in there.

Any help anyone can give me would be truly appreciated!

 

Cheers,

Michael Hutchinson

Linux Systems Administrator

Manux Solutions Ltd

[EMAIL PROTECTED]

http://www.manux.co.nz  

 



Re: sa-learn error message

2008-01-17 Thread Steven Stern

Theo Van Dinter wrote:

On Thu, Jan 17, 2008 at 03:28:06PM -0600, Steven Stern wrote:

bayes db version 0  indicates your bayes file is corrupt. It should be
version 3.  Do you have a backup?  SQL or .db?


It doesn't necessarily mean there's corruption,
in fact, since the learning continued and seemed
to finish ok, it's unlikely to be corruption.  See
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=3563 for a possible
libdb issue which causes it.



Thanks. I ran into this when I hosed the sa_bayes MySQL database as we 
were cloning one of our MX servers.


Re: sa-learn error message

2008-01-17 Thread mackenna

Thank you to both responders.

Did I read something that said that the digit after bayes db  
version indicated the version of Berkeley DB that's installed on the  
system?  Like 0 means 1.x...   Google shows various messages like  
bayes db version 2 is not able to be used, aborting! which would  
seem to indicate that 0 is not indicative of the problem I saw.


Perhaps the reason that the bug report lists 0 is that Berkeley DB  
version 1.x does not include an integrated locking mechanism, but  
higher versions are reputed to have such a mechanism.


Before I got your responses, I got my courage up and went on to run

$ sa-learn --no-sync --ham ham  (ham is the name of a directory  
in the current working directory)

$ sa-learn --sync

and everything went well.  This sort of indicates that the DB isn't  
hopelessly corrupt.


Please, where is this DB that I should back up?

I wrote a GP Berkeley DB rebuilding program that reads all of the key/ 
value pairs in a DB, and writes all of those for which the key and  
value are defined and of non-zero length, to a new DB.  I could try  
running that and see if the new DB is significantly smaller than the  
old, which for my DBs indicates that it's time to use the new DB.


Theo, do you know if SA uses any entries with null keys or values,  
that are needed for proper operation?  It would be easy to keep  
entries with null values; I wrote the program to discard them because  
my DBs don't use such entries.


Thanks,
Craig MacKenna


On Jan 17, 2008, at 1:50 PM, Theo Van Dinter wrote:


On Thu, Jan 17, 2008 at 03:28:06PM -0600, Steven Stern wrote:
bayes db version 0  indicates your bayes file is corrupt. It  
should be

version 3.  Do you have a backup?  SQL or .db?


It doesn't necessarily mean there's corruption,
in fact, since the learning continued and seemed
to finish ok, it's unlikely to be corruption.  See
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=3563 for a  
possible

libdb issue which causes it.

--
Randomly Selected Tagline:
And the No. 1 response that you'll need to memorize if you plan to  
bet

 your business on Windows 2000: 'You want fries with that?'
 - Nicholas Petreley




Re: sa-learn error message

2008-01-17 Thread Theo Van Dinter
On Thu, Jan 17, 2008 at 07:42:30PM -0800, [EMAIL PROTECTED] wrote:
 Did I read something that said that the digit after bayes db  
 version indicated the version of Berkeley DB that's installed on the  
 system?  Like 0 means 1.x...   Google shows various messages like  
 bayes db version 2 is not able to be used, aborting! which would  
 seem to indicate that 0 is not indicative of the problem I saw.

The bayes version has nothing to do with the version of Berkeley DB.  It's the
version of the Bayes data.  It's been 3 for a while now.  from man sa-learn:

 The database ’version number’ is 0 for databases from 2.5x, 1 for
 databases from certain 2.6x development releases, and 2 for all more
 recent databases.

Hrm.  Interestingly, it doesn't mention version 3, which was introduced
in 3.0 and has been used in all later versions.  I'll update the man
page in a minute. :)

 Perhaps the reason that the bug report lists 0 is that Berkeley DB  
 version 1.x does not include an integrated locking mechanism, but  
 higher versions are reputed to have such a mechanism.

The DB_File module, used to access the database files, uses the 1.x API, so no
locking from libdb.  SA does locking on its own.  If you're not using NFS, I'd
recommend using lock_method flock, btw.

 Please, where is this DB that I should back up?

It depends what your configuration is.  Typically it's
~/.spamassassin/bayes_toks.  Otherwise, look at the bayes_path setting.

 I wrote a GP Berkeley DB rebuilding program that reads all of the key/ 
 value pairs in a DB, and writes all of those for which the key and  
 value are defined and of non-zero length, to a new DB.  I could try  
 running that and see if the new DB is significantly smaller than the  
 old, which for my DBs indicates that it's time to use the new DB.

This generally isn't needed by SA, since it's is what SA does when
an expire happens.  You should also look at 'sa-learn --backup' and
'--restore'.  You could also just use db_dump | db_restore.

 Theo, do you know if SA uses any entries with null keys or values,  
 that are needed for proper operation?  It would be easy to keep  
 entries with null values; I wrote the program to discard them because  
 my DBs don't use such entries.

There will be values with null (ascii 0) in them as the token keys are
binary values, and the values are binary packed values.  This is why
sa-learn --backup is a good choice, it will convert the binary into
text.

-- 
Randomly Selected Tagline:
... then you'll excuse me, but I'm in the middle of fifteen things, all of
 them annoying.
 - Ivonova, Babylon 5 (Midnight on the Firing Line)


pgpqqHGuQzBT4.pgp
Description: PGP signature


Multiple per user?

2008-01-17 Thread Rajkumar S
Hi,

I am using SA from amavis-new, with postfix, in before-queue
configuration, with per user scores (provided by amavis) I also want
to implement per user bayes. But since SA takes only a single user
name, amavis-new is not able to implement per-user bayes.

Is it a good idea for SA to take multiple user names for a single
mail, and run all checks for multiple users? For example if a mail has
3 RCPT TO: SA will be called with all 3 rcpt and SA will report back
with 3 sets of results. I hope this will be more efficient than
calling SA 3 times inside procmail or other MDA.  This will make life
of every one running SA in MTA level very easy.

I am sure other people have thought about this before me, and since
this is not yet implemented, is this:

1. brain dead
2. require too much work ripping apart SA's guts
3. just needs some one to hack on this, but can be done relatively easy.

I can do bit of coding if it's case 3.

with regards,

raj