Re: Which DB is actually used?

2006-09-12 Thread Bo Mellberg



jdow skrev:

From: "Logan Shaw" <[EMAIL PROTECTED]>


On Fri, 8 Sep 2006, Bo Mellberg wrote:
It seems like the exim-users database is being touched regularly, so 
I'm guessing that it has been set up by apt-get in some 
"auto-learning" state.


Yes, you might want to check whatever's running SpamAssassin and
see what user it's running as and also check the configuration
files (probably in /etc/mail/spamassassin) to see where it's
storing the database.

I have earlier trained spam and ham as user "bosse", which is why 
there is a working db there as well.


As I am the only user on my system, it really doesn't matter if I use 
site-wide or not, but rather how I invoke sa-learn.


Lets say I remove the databases for "bosse" and "root". Is this the 
proper >> way to invoke sa-learn:


1. Log on as user "bosse"
2. sa-learn --showdots --sync --dbpath /var/spool/exim4/.spamassassin 
--spam /home/bosse/Maildir/.MissedSpam/cur


Probably not, or at least not the best way.


Absolutely not. The database under "bosse" is quite apparently not
being used except for his misplaced training. He needs to "su -l exim4"
and then run sa-learn.


I thought that this was what --dbpath was meant for. To tell sa-learn 
what database to actually update. In the case above, the exim DB is 
trained with spam from the "bosse"-user. So IF the exim DB is the one 
used for spam control, it would with the above command be the one 
trained, no?


A better solution is ofcourse to tell SA to use "per user" databases and 
log on as bosse and train normally. I'll do some RTFM and googling to 
see how the setup for Debian is actually made.


/Bo


Re: Which DB is actually used?

2006-09-10 Thread mouss

Bo Mellberg wrote:

jdow skrev:

From: "Bo Mellberg" <[EMAIL PROTECTED]>


I have SA 3.1.4 configured and running on Debian Sarge using apt-get.

I'm finding it hard to know what directory is actually used for the 
bayes-database:


max:~# ls /root/.spamassassin/ -al
total 2344
drwx--  2 root root4096 Sep  8 07:52 .
drwxr-xr-x 12 root root4096 Sep  5 09:37 ..
-rw---  1 root root   12288 Sep  4 14:20 auto-whitelist
-rw-rw-rw-  1 root root   6 Sep  4 14:20 auto-whitelist.mutex
-rw-rw-rw-  1 root root   13992 Sep  4 14:08 bayes.mutex
-rw---  1 root root  344064 Sep  4 14:05 bayes_seen
-rw---  1 root root 2605056 Sep  8 07:52 bayes_toks
-rw-r--r--  1 root root1487 Sep  4 14:20 user_prefs
max:~# ls /home/bosse/.spamassassin/ -al
total 4564
drwx--S--- 2 bosse bosse4096 Sep  7 10:35 .
drwxr-sr-x 5 bosse bosse4096 Aug 31 16:19 ..
-rw--- 1 root  bosse   12288 Sep  6 01:06 auto-whitelist
-rw--- 1 root  bosse   6 Sep  6 01:06 auto-whitelist.mutex
-rw-rw-rw- 1 bosse bosse   15282 Sep  6 01:06 bayes.mutex
-rw--- 1 root  bosse   86136 Sep  6 01:06 bayes_journal
-rw--- 1 bosse bosse  339968 Sep  6 01:06 bayes_seen
-rw--- 1 root  bosse 5255168 Sep  6 01:06 bayes_toks
-rw--- 1 root  bosse1165 Oct  2  2005 user_prefs
max:~# ls /var/spool/exim4/.spamassassin/ -al
total 3424
drwx-- 2 Debian-exim Debian-exim4096 Sep  8 08:04 .
drwxr-x--- 7 Debian-exim Debian-exim4096 Sep  5 15:54 ..
-rw--- 1 Debian-exim Debian-exim 1298432 Sep  8 08:04 
auto-whitelist
-rw-rw-rw- 1 Debian-exim Debian-exim   6 Sep  4 14:15 
auto-whitelist.mutex

-rw-rw-rw- 1 Debian-exim Debian-exim   6 Sep  4 14:15 bayes.mutex
-rw--- 1 Debian-exim Debian-exim   64704 Sep  8 08:04 bayes_journal
-rw--- 1 Debian-exim Debian-exim  319488 Sep  8 08:04 bayes_seen
-rw--- 1 Debian-exim Debian-exim 2629632 Sep  8 08:04 bayes_toks
-rw-r--r-- 1 Debian-exim Debian-exim1175 Nov  1  2005 user_prefs

As you can see there are three directories which are all quite 
recently changed. How can I make sure that only one directory is used?


I would like to make SA site-wide, but the filtering is working 
really good right now so I'm afraid i'll break something. BTW, the 
user "bosse" is my own account used for my email.


* I just performed sa-learn --sync -D as root.
* I've never touched the exim directory, still it has the latest 
change date.


Thanks in advance.

/Bo


Bo - I can't particularly help you with the single site-wide database
thing. It seems you have a bit if a mishmash that depending on things
you have done may be actually acting the way you want it to act. It
looks like you might have played with training or tests as "bosse"
and "root" and otherwise have everything working on the exim4 global
database. Always test and train as the user that is used for filtering
the email by the MTA. Other tests and training are meaningless.

If you do not have many users at all, dozens or less, then do
consider using per user BAYES. It CAN provide the users with a better
anti-spam experience. The reasoning behind this is that one user's
spam is almost always going to be some other user's ham. If you have
hundreds then there might be a good reason for a single BAYES database.
By the time you're into thousands you're using virtual accounts and
a global database may be required. But it won't provide quite the pin-
point accuracy of a per user database.

{^_^}




Thanks for this info,

It seems like the exim-users database is being touched regularly, so 
I'm guessing that it has been set up by apt-get in some 
"auto-learning" state.


I have earlier trained spam and ham as user "bosse", which is why 
there is a working db there as well.


As I am the only user on my system, it really doesn't matter if I use 
site-wide or not, but rather how I invoke sa-learn.


Lets say I remove the databases for "bosse" and "root". Is this the 
proper way to invoke sa-learn:


1. Log on as user "bosse"
2. sa-learn --showdots --sync --dbpath /var/spool/exim4/.spamassassin 
--spam /home/bosse/Maildir/.MissedSpam/cur


If I set up a cron job to do the above I could just toss missed spam 
into the "MissedSpam"-folder right?


One way is to use a mysql db and have something like this in your 
configuration:


## global bayes db
bayes_sql_override_username spamassassin

then you won't have to worry who runs the filter and who trains it.

see the wiki for how to migrate. if you migrate, connect to your mysql 
and update the user field to match the one used in the configuration 
("spamassassin" above).




Re: Which DB is actually used?

2006-09-08 Thread jdow

From: "Logan Shaw" <[EMAIL PROTECTED]>


On Fri, 8 Sep 2006, Bo Mellberg wrote:
It seems like the exim-users database is being touched regularly, so I'm guessing that 
it has been set up by apt-get in some "auto-learning" state.


Yes, you might want to check whatever's running SpamAssassin and
see what user it's running as and also check the configuration
files (probably in /etc/mail/spamassassin) to see where it's
storing the database.

I have earlier trained spam and ham as user "bosse", which is why there is a working db 
there as well.


As I am the only user on my system, it really doesn't matter if I use site-wide or not, 
but rather how I invoke sa-learn.


Lets say I remove the databases for "bosse" and "root". Is this the proper >> way to 
invoke sa-learn:


1. Log on as user "bosse"
2. sa-learn --showdots --sync --dbpath /var/spool/exim4/.spamassassin --spam 
/home/bosse/Maildir/.MissedSpam/cur


Probably not, or at least not the best way.


Absolutely not. The database under "bosse" is quite apparently not
being used except for his misplaced training. He needs to "su -l exim4"
and then run sa-learn.

(Were it me I'd rip out amavisd-new and put in something that
(IMAO {^,-}) works like procmail. I'd not sure I'd use Exim, either,
unless it can explicitly run spamc as the user "bosse". At the VERY
least I'd read whatever manual existed for amavisd-new and Exim such
that "spamc -u bosse" will work and have spamd access the "bosse"
database. Of course, if spamd is running in a sandbox it can't make
that reach without some skullduggery. So the entire installation needs
to be examined and manipulated so that per user BAYES can work "fer
shure". That's a LOT of RTFM and examine your system configuration,
to be sure. But learning only hurts a little and having learned is
a nice feeling.)


First of all, you need to run sa-learn as the same user that
runs the filtering.  Since you haven't said what user that it
is (whether it's "bosse" or some other user), it's impossible
to say whether that's the correct user to run sa-learn as.


Exactly - and he's not doing that.

If I set up a cron job to do the above I could just toss missed spam into the 
"MissedSpam"-folder right?


Yeah, but for efficiency reasons, you'd probably not want
messages in that folder to keep accumulating forever, so you'd
probably want a way to purge them after some period of time.
sa-learn can cope with a situation where you feed it the same
message repeatedly with no harm, but it's still a waste of
CPU cycles.


I have ham, spam, oldham, and oldspam entries for my learning process
done via IMAP folders. Once a night spam is learned. When the spam or
ham folder gets more than say a dozen entries I move them over to oldspam
or oldham respectively. That way I keep my old learn database around so
I can rebuild it. Of course, I manually train ONLY. There's none of that
silly autolearn happening here. It's too prone to going off in wild
strange new directions orthoganal (at the very least) to good sense.
Again, YMMV and IMAO liberally apply to the above statements.

{^_^} 



Re: Which DB is actually used?

2006-09-08 Thread Logan Shaw

On Fri, 8 Sep 2006, Logan Shaw wrote:

Second, once you determine the correct user, in most cases
sa-learn should consult the same configuration file that
the learning process does, so there shouldn't be a reason to
give --dbpath.


Oops, that should have said "that the scanning process does".

  - Logan


Re: Which DB is actually used?

2006-09-08 Thread Logan Shaw

On Fri, 8 Sep 2006, Bo Mellberg wrote:
It seems like the exim-users database is being touched regularly, so I'm 
guessing that it has been set up by apt-get in some "auto-learning" state.


Yes, you might want to check whatever's running SpamAssassin and
see what user it's running as and also check the configuration
files (probably in /etc/mail/spamassassin) to see where it's
storing the database.

I have earlier trained spam and ham as user "bosse", which is why there is a 
working db there as well.


As I am the only user on my system, it really doesn't matter if I use 
site-wide or not, but rather how I invoke sa-learn.


Lets say I remove the databases for "bosse" and "root". Is this the proper 
way to invoke sa-learn:


1. Log on as user "bosse"
2. sa-learn --showdots --sync --dbpath /var/spool/exim4/.spamassassin --spam 
/home/bosse/Maildir/.MissedSpam/cur


Probably not, or at least not the best way.

First of all, you need to run sa-learn as the same user that
runs the filtering.  Since you haven't said what user that it
is (whether it's "bosse" or some other user), it's impossible
to say whether that's the correct user to run sa-learn as.

Second, once you determine the correct user, in most cases
sa-learn should consult the same configuration file that
the learning process does, so there shouldn't be a reason to
give --dbpath.

And finally, you don't really need to run --sync every time
you train the Bayes database, although I guess it wouldn't hurt.

If I set up a cron job to do the above I could just toss missed spam into the 
"MissedSpam"-folder right?


Yeah, but for efficiency reasons, you'd probably not want
messages in that folder to keep accumulating forever, so you'd
probably want a way to purge them after some period of time.
sa-learn can cope with a situation where you feed it the same
message repeatedly with no harm, but it's still a waste of
CPU cycles.

  - Logan


Re: Which DB is actually used?

2006-09-08 Thread Bo Mellberg

jdow skrev:

From: "Bo Mellberg" <[EMAIL PROTECTED]>


I have SA 3.1.4 configured and running on Debian Sarge using apt-get.

I'm finding it hard to know what directory is actually used for the 
bayes-database:


max:~# ls /root/.spamassassin/ -al
total 2344
drwx--  2 root root4096 Sep  8 07:52 .
drwxr-xr-x 12 root root4096 Sep  5 09:37 ..
-rw---  1 root root   12288 Sep  4 14:20 auto-whitelist
-rw-rw-rw-  1 root root   6 Sep  4 14:20 auto-whitelist.mutex
-rw-rw-rw-  1 root root   13992 Sep  4 14:08 bayes.mutex
-rw---  1 root root  344064 Sep  4 14:05 bayes_seen
-rw---  1 root root 2605056 Sep  8 07:52 bayes_toks
-rw-r--r--  1 root root1487 Sep  4 14:20 user_prefs
max:~# ls /home/bosse/.spamassassin/ -al
total 4564
drwx--S--- 2 bosse bosse4096 Sep  7 10:35 .
drwxr-sr-x 5 bosse bosse4096 Aug 31 16:19 ..
-rw--- 1 root  bosse   12288 Sep  6 01:06 auto-whitelist
-rw--- 1 root  bosse   6 Sep  6 01:06 auto-whitelist.mutex
-rw-rw-rw- 1 bosse bosse   15282 Sep  6 01:06 bayes.mutex
-rw--- 1 root  bosse   86136 Sep  6 01:06 bayes_journal
-rw--- 1 bosse bosse  339968 Sep  6 01:06 bayes_seen
-rw--- 1 root  bosse 5255168 Sep  6 01:06 bayes_toks
-rw--- 1 root  bosse1165 Oct  2  2005 user_prefs
max:~# ls /var/spool/exim4/.spamassassin/ -al
total 3424
drwx-- 2 Debian-exim Debian-exim4096 Sep  8 08:04 .
drwxr-x--- 7 Debian-exim Debian-exim4096 Sep  5 15:54 ..
-rw--- 1 Debian-exim Debian-exim 1298432 Sep  8 08:04 auto-whitelist
-rw-rw-rw- 1 Debian-exim Debian-exim   6 Sep  4 14:15 
auto-whitelist.mutex

-rw-rw-rw- 1 Debian-exim Debian-exim   6 Sep  4 14:15 bayes.mutex
-rw--- 1 Debian-exim Debian-exim   64704 Sep  8 08:04 bayes_journal
-rw--- 1 Debian-exim Debian-exim  319488 Sep  8 08:04 bayes_seen
-rw--- 1 Debian-exim Debian-exim 2629632 Sep  8 08:04 bayes_toks
-rw-r--r-- 1 Debian-exim Debian-exim1175 Nov  1  2005 user_prefs

As you can see there are three directories which are all quite 
recently changed. How can I make sure that only one directory is used?


I would like to make SA site-wide, but the filtering is working really 
good right now so I'm afraid i'll break something. BTW, the user 
"bosse" is my own account used for my email.


* I just performed sa-learn --sync -D as root.
* I've never touched the exim directory, still it has the latest 
change date.


Thanks in advance.

/Bo


Bo - I can't particularly help you with the single site-wide database
thing. It seems you have a bit if a mishmash that depending on things
you have done may be actually acting the way you want it to act. It
looks like you might have played with training or tests as "bosse"
and "root" and otherwise have everything working on the exim4 global
database. Always test and train as the user that is used for filtering
the email by the MTA. Other tests and training are meaningless.

If you do not have many users at all, dozens or less, then do
consider using per user BAYES. It CAN provide the users with a better
anti-spam experience. The reasoning behind this is that one user's
spam is almost always going to be some other user's ham. If you have
hundreds then there might be a good reason for a single BAYES database.
By the time you're into thousands you're using virtual accounts and
a global database may be required. But it won't provide quite the pin-
point accuracy of a per user database.

{^_^}




Thanks for this info,

It seems like the exim-users database is being touched regularly, so I'm 
guessing that it has been set up by apt-get in some "auto-learning" state.


I have earlier trained spam and ham as user "bosse", which is why there 
is a working db there as well.


As I am the only user on my system, it really doesn't matter if I use 
site-wide or not, but rather how I invoke sa-learn.


Lets say I remove the databases for "bosse" and "root". Is this the 
proper way to invoke sa-learn:


1. Log on as user "bosse"
2. sa-learn --showdots --sync --dbpath /var/spool/exim4/.spamassassin 
--spam /home/bosse/Maildir/.MissedSpam/cur


If I set up a cron job to do the above I could just toss missed spam 
into the "MissedSpam"-folder right?


Thanks again!

/Bo


Re: Which DB is actually used?

2006-09-08 Thread jdow

From: "Bo Mellberg" <[EMAIL PROTECTED]>


I have SA 3.1.4 configured and running on Debian Sarge using apt-get.

I'm finding it hard to know what directory is actually used for the 
bayes-database:


max:~# ls /root/.spamassassin/ -al
total 2344
drwx--  2 root root4096 Sep  8 07:52 .
drwxr-xr-x 12 root root4096 Sep  5 09:37 ..
-rw---  1 root root   12288 Sep  4 14:20 auto-whitelist
-rw-rw-rw-  1 root root   6 Sep  4 14:20 auto-whitelist.mutex
-rw-rw-rw-  1 root root   13992 Sep  4 14:08 bayes.mutex
-rw---  1 root root  344064 Sep  4 14:05 bayes_seen
-rw---  1 root root 2605056 Sep  8 07:52 bayes_toks
-rw-r--r--  1 root root1487 Sep  4 14:20 user_prefs
max:~# ls /home/bosse/.spamassassin/ -al
total 4564
drwx--S--- 2 bosse bosse4096 Sep  7 10:35 .
drwxr-sr-x 5 bosse bosse4096 Aug 31 16:19 ..
-rw--- 1 root  bosse   12288 Sep  6 01:06 auto-whitelist
-rw--- 1 root  bosse   6 Sep  6 01:06 auto-whitelist.mutex
-rw-rw-rw- 1 bosse bosse   15282 Sep  6 01:06 bayes.mutex
-rw--- 1 root  bosse   86136 Sep  6 01:06 bayes_journal
-rw--- 1 bosse bosse  339968 Sep  6 01:06 bayes_seen
-rw--- 1 root  bosse 5255168 Sep  6 01:06 bayes_toks
-rw--- 1 root  bosse1165 Oct  2  2005 user_prefs
max:~# ls /var/spool/exim4/.spamassassin/ -al
total 3424
drwx-- 2 Debian-exim Debian-exim4096 Sep  8 08:04 .
drwxr-x--- 7 Debian-exim Debian-exim4096 Sep  5 15:54 ..
-rw--- 1 Debian-exim Debian-exim 1298432 Sep  8 08:04 auto-whitelist
-rw-rw-rw- 1 Debian-exim Debian-exim   6 Sep  4 14:15 
auto-whitelist.mutex

-rw-rw-rw- 1 Debian-exim Debian-exim   6 Sep  4 14:15 bayes.mutex
-rw--- 1 Debian-exim Debian-exim   64704 Sep  8 08:04 bayes_journal
-rw--- 1 Debian-exim Debian-exim  319488 Sep  8 08:04 bayes_seen
-rw--- 1 Debian-exim Debian-exim 2629632 Sep  8 08:04 bayes_toks
-rw-r--r-- 1 Debian-exim Debian-exim1175 Nov  1  2005 user_prefs

As you can see there are three directories which are all quite recently 
changed. How can I make sure that only one directory is used?


I would like to make SA site-wide, but the filtering is working really 
good right now so I'm afraid i'll break something. BTW, the user "bosse" 
is my own account used for my email.


* I just performed sa-learn --sync -D as root.
* I've never touched the exim directory, still it has the latest change 
date.


Thanks in advance.

/Bo


Bo - I can't particularly help you with the single site-wide database
thing. It seems you have a bit if a mishmash that depending on things
you have done may be actually acting the way you want it to act. It
looks like you might have played with training or tests as "bosse"
and "root" and otherwise have everything working on the exim4 global
database. Always test and train as the user that is used for filtering
the email by the MTA. Other tests and training are meaningless.

If you do not have many users at all, dozens or less, then do
consider using per user BAYES. It CAN provide the users with a better
anti-spam experience. The reasoning behind this is that one user's
spam is almost always going to be some other user's ham. If you have
hundreds then there might be a good reason for a single BAYES database.
By the time you're into thousands you're using virtual accounts and
a global database may be required. But it won't provide quite the pin-
point accuracy of a per user database.

{^_^}



Which DB is actually used?

2006-09-07 Thread Bo Mellberg

I have SA 3.1.4 configured and running on Debian Sarge using apt-get.

I'm finding it hard to know what directory is actually used for the 
bayes-database:


max:~# ls /root/.spamassassin/ -al
total 2344
drwx--  2 root root4096 Sep  8 07:52 .
drwxr-xr-x 12 root root4096 Sep  5 09:37 ..
-rw---  1 root root   12288 Sep  4 14:20 auto-whitelist
-rw-rw-rw-  1 root root   6 Sep  4 14:20 auto-whitelist.mutex
-rw-rw-rw-  1 root root   13992 Sep  4 14:08 bayes.mutex
-rw---  1 root root  344064 Sep  4 14:05 bayes_seen
-rw---  1 root root 2605056 Sep  8 07:52 bayes_toks
-rw-r--r--  1 root root1487 Sep  4 14:20 user_prefs
max:~# ls /home/bosse/.spamassassin/ -al
total 4564
drwx--S--- 2 bosse bosse4096 Sep  7 10:35 .
drwxr-sr-x 5 bosse bosse4096 Aug 31 16:19 ..
-rw--- 1 root  bosse   12288 Sep  6 01:06 auto-whitelist
-rw--- 1 root  bosse   6 Sep  6 01:06 auto-whitelist.mutex
-rw-rw-rw- 1 bosse bosse   15282 Sep  6 01:06 bayes.mutex
-rw--- 1 root  bosse   86136 Sep  6 01:06 bayes_journal
-rw--- 1 bosse bosse  339968 Sep  6 01:06 bayes_seen
-rw--- 1 root  bosse 5255168 Sep  6 01:06 bayes_toks
-rw--- 1 root  bosse1165 Oct  2  2005 user_prefs
max:~# ls /var/spool/exim4/.spamassassin/ -al
total 3424
drwx-- 2 Debian-exim Debian-exim4096 Sep  8 08:04 .
drwxr-x--- 7 Debian-exim Debian-exim4096 Sep  5 15:54 ..
-rw--- 1 Debian-exim Debian-exim 1298432 Sep  8 08:04 auto-whitelist
-rw-rw-rw- 1 Debian-exim Debian-exim   6 Sep  4 14:15 
auto-whitelist.mutex

-rw-rw-rw- 1 Debian-exim Debian-exim   6 Sep  4 14:15 bayes.mutex
-rw--- 1 Debian-exim Debian-exim   64704 Sep  8 08:04 bayes_journal
-rw--- 1 Debian-exim Debian-exim  319488 Sep  8 08:04 bayes_seen
-rw--- 1 Debian-exim Debian-exim 2629632 Sep  8 08:04 bayes_toks
-rw-r--r-- 1 Debian-exim Debian-exim1175 Nov  1  2005 user_prefs

As you can see there are three directories which are all quite recently 
changed. How can I make sure that only one directory is used?


I would like to make SA site-wide, but the filtering is working really 
good right now so I'm afraid i'll break something. BTW, the user "bosse" 
is my own account used for my email.


* I just performed sa-learn --sync -D as root.
* I've never touched the exim directory, still it has the latest change 
date.


Thanks in advance.

/Bo