Re: Am I fscking up my bayes db?

2009-07-12 Thread Matus UHLAR - fantomas
On 09.07.09 09:30, Daniel Schaefer wrote:
 I have a similar setup. If a Spam message makes it to my inbox with less  
 than the required_score, I put it into a SPAM folder and run sa-learn on  
 the folder. Should I also implement the following ignore rules?

 bayes_ignore_header X-Spam-Flag
 bayes_ignore_header X-Spam-Level
 bayes_ignore_header X-Spam-Status
 bayes_ignore_header X-Spam...etc.

Not needed, these are already ignored by spamassassin itself.
-- 
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
I just got lost in thought. It was unfamiliar territory. 


Re: Am I fscking up my bayes db?

2009-07-09 Thread Mike Cardwell

Steve Bertrand wrote:

Hi everyone,

I aggregate my work and personal email accounts within the same email
client. All accounts are IMAP-based.

My $work employs a Barracuda cluster, and of course my box runs SA.


From time-to-time, I'll get a SPAM message come through the 'cuda's.



From there, I move the message from one IMAP folder in my MUA into

another SPAM folder, which essentially is a transfer from a work storage
server onto my server.

Every few days, I run sa-learn against the collected SPAM messages.

My question is, given that the messages have already been processed by
the 'cuda's (with their header stamps in place), am I damaging, or at
risk of confusing the learning process of SA when I classify these
messages as SPAM?

Are there any negative consequences by doing this?


You should configure bayes to ignore those headers. In your local.cf, 
list each of the cuda headers like this:


bayes_ignore_header X-CudaHeader1
bayes_ignore_header X-CudaHeader2
bayes_ignore_header X-CudaHeader3

--
Mike Cardwell - IT Consultant and LAMP developer
Cardwell IT Ltd. (UK Reg'd Company #06920226) http://cardwellit.com/


Re: Am I fscking up my bayes db?

2009-07-09 Thread Daniel Schaefer

Mike Cardwell wrote:

Steve Bertrand wrote:

Hi everyone,

I aggregate my work and personal email accounts within the same email
client. All accounts are IMAP-based.

My $work employs a Barracuda cluster, and of course my box runs SA.


From time-to-time, I'll get a SPAM message come through the 'cuda's.



From there, I move the message from one IMAP folder in my MUA into

another SPAM folder, which essentially is a transfer from a work storage
server onto my server.

Every few days, I run sa-learn against the collected SPAM messages.

My question is, given that the messages have already been processed by
the 'cuda's (with their header stamps in place), am I damaging, or at
risk of confusing the learning process of SA when I classify these
messages as SPAM?

Are there any negative consequences by doing this?


You should configure bayes to ignore those headers. In your local.cf, 
list each of the cuda headers like this:


bayes_ignore_header X-CudaHeader1
bayes_ignore_header X-CudaHeader2
bayes_ignore_header X-CudaHeader3

I have a similar setup. If a Spam message makes it to my inbox with less 
than the required_score, I put it into a SPAM folder and run sa-learn on 
the folder. Should I also implement the following ignore rules?


bayes_ignore_header X-Spam-Flag
bayes_ignore_header X-Spam-Level
bayes_ignore_header X-Spam-Status
bayes_ignore_header X-Spam...etc.

--
Dan Schaefer



Re: Am I fscking up my bayes db?

2009-07-09 Thread Steve Bertrand
Mike Cardwell wrote:
 Steve Bertrand wrote:

 My question is, given that the messages have already been processed by
 the 'cuda's (with their header stamps in place), am I damaging, or at
 risk of confusing the learning process of SA when I classify these
 messages as SPAM?

 Are there any negative consequences by doing this?
 
 You should configure bayes to ignore those headers. In your local.cf,
 list each of the cuda headers like this:
 
 bayes_ignore_header X-CudaHeader1
 bayes_ignore_header X-CudaHeader2
 bayes_ignore_header X-CudaHeader3

Thanks Mike.

It's extremely infrequent how often I have to touch my email setup, but
I've always been curious about this.

Given your recommendation, would you say that a reset on the db should
be performed?

Essentially, is it fair to say that what I've done has possibly caused
damage?

Steve

ps. fwiw, I feel that my SA setup is not under-performing in any way at
this time.


smime.p7s
Description: S/MIME Cryptographic Signature


Re: Am I fscking up my bayes db?

2009-07-09 Thread Martin Gregorie
On Thu, 2009-07-09 at 08:50 -0400, Steve Bertrand wrote:
 My question is, given that the messages have already been processed by
 the 'cuda's (with their header stamps in place), am I damaging, or at
 risk of confusing the learning process of SA when I classify these
 messages as SPAM?
 
Not really answering your question, but I find its helpful to strip SA
headers out of the message collection I use for testing private rules.
Here's a simple bash shell script fragment that does the job and does it
fairly fast:


for f in data/*.txt
do
echo Cleaning $f 
gawk '
BEGIN   { act = copy }
/^X-Spam/   { act = skip }
/^[A-WYZ]/  { act = copy }
{  
  if (act == copy)
{ print }
}
' $f temp.txt
mv temp.txt $f
done



Martin




Re: Am I fscking up my bayes db?

2009-07-09 Thread Chr. von Stuckrad
On Thu, 09 Jul 2009, Martin Gregorie wrote:

 Here's a simple bash shell script fragment that does the job and does it
 fairly fast:
 
 
 for f in data/*.txt
...
 gawk '
...
 done
 

Having also Non-LINUX-Users on the list, you might have explained
that THIS script needs 'gawk' (old awk would be enough) and
works on 'alle the Files in one directory, if their names
end on '.txt' :-) E.g. my mail-collection-files mostly end on
'*.box' or '*.eml' and my old Solaris never had any 'gawk'.
The trick to delete all runs of 'X' Headers from 'X-Spam' on
is a good idea (execept e.g. if the next Header is 'X-remote-IP'
and you want to check for internal Mail :-).

Stucki
-- 
Christoph von Stuckrad  * * |nickname |Mail stu...@mi.fu-berlin.de \
Freie Universitaet Berlin   |/_*|'stucki' |Tel(Mo.,Do.):+49 30 838-75 459|
Mathematik  Informatik EDV |\ *|if online|  (Di,Mi,Fr):+49 30 77 39 6600|
Takustr. 9 / 14195 Berlin   * * |on IRCnet|Fax(home):   +49 30 77 39 6601/


Re: Am I fscking up my bayes db?

2009-07-09 Thread John Hardin

On Thu, 9 Jul 2009, Martin Gregorie wrote:


On Thu, 2009-07-09 at 08:50 -0400, Steve Bertrand wrote:

My question is, given that the messages have already been processed by
the 'cuda's (with their header stamps in place), am I damaging, or at
risk of confusing the learning process of SA when I classify these
messages as SPAM?


Not really answering your question, but I find its helpful to strip SA
headers out of the message collection I use for testing private rules.
Here's a simple bash shell script fragment that does the job and does it
fairly fast:


for f in data/*.txt
do
   echo Cleaning $f
   gawk '
   BEGIN   { act = copy }
   /^X-Spam/   { act = skip }
   /^[A-WYZ]/  { act = copy }
   {
 if (act == copy)
   { print }
   }
   ' $f temp.txt
   mv temp.txt $f
done



...wouldn't that mangle wrapped X-Spam headers?

--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  North Korea: the only country in the world where people would risk
  execution to flee to communist China.  -- Ride Fast
---
 11 days until the 40th anniversary of Apollo 11 landing on the Moon


Re: Am I fscking up my bayes db?

2009-07-09 Thread RW
On Thu, 09 Jul 2009 09:30:37 -0400
Steve Bertrand st...@ibctech.ca wrote:


 It's extremely infrequent how often I have to touch my email setup,
 but I've always been curious about this.
 
 Given your recommendation, would you say that a reset on the db should
 be performed?

 Essentially, is it fair to say that what I've done has possibly caused
 damage?

The Barracuda headers don't matter much unless you get similar headers
in your legitimate incoming mail, in which case just tell bayes to
ignore them. The irrelevant tokens will eventually age out of the
database.

The received headers are a bit more of a problem because you're
weighting bayes against your work domain, ip addresses etc. You could
try sending yourself a mail from work and see if it looks spammy.