Re: About reporting

2009-09-21 Thread Benny Pedersen

On tir 22 sep 2009 03:05:30 CEST, João Eiras wrote

Goodbye.


dont want to be here anymore ?

--
xpoint



Re: Re-running SA on an mbox

2009-09-21 Thread Benny Pedersen

On man 21 sep 2009 20:33:57 CEST, MySQL Student wrote

but this will invalidtate dkim headers if this headers
is signed, are spamassassin aware of this problem ? (in general)

Are you saying there is a bug?


partly yes, its not a bug as long you keep the orginal email

but spamassassin --mbox < infile > outfile invalidate dkim signed mails

no ?


mutt -f mbox
in mutt save to another folder if missclassified

Yes, I use pine for that, but would like to eliminate as many of
the FNs as possible, particularly ones that I can't determine visually.


can pine sort mails from header contense ?

if yes it will be less manuel work for you

--
xpoint



Re: Re-running SA on an mbox

2009-09-21 Thread MySQL Student
Hi,

It's certainly not a fast operation, but using the following will
split an mbox into individual messages:

export FILENO=0
mkdir msgs
formail -s sh -c 'cat - >msgs/$FILENO' < mbox-name.mbox

I also created a loop that would strip all the SA headers from the messages:

for file in *; do echo Processing: $file; spamassassin -d < $file >
$file.txt; done

This worked for a few hundred of the messages, but then started to
fail on my production system with:

[22135] warn: bayes: cannot open bayes databases
/home/user/.spamassassin/bayes_* R/W: lock failed: File exists

How can I tell when another process is using the database and when it
is free for my script to use?

Is there a faster way to run spamassassin just to strip the SA headers?

Maybe there is a faster way, like passing the messages through the
running amavisd instead of having to restart spamassassin each time to
re-process each message?

Thanks,
Alex


Re: About reporting

2009-09-21 Thread João Eiras



On 19.09.09 22:45, João Eiras wrote:

I still haven't got the answer to my last question, so here it goes again:
Can I report a full mbs file with many mails in one go ? Or should I split each 
mail on it's own file ?


sa-learn can accept mail in mbox format. It's even in its manual page.
Is this what you have meant?


Might be. I'm not familiar with spam assassin's internals, just a normal e-mail 
user wanting to report some spam.
A quick man sa-learn shows

--mbox
sa-learn will read in the file(s) containing the emails to be 
learned, and will process them in mbox format (one or more emails per file).

So, I have my answer.
The page at http://wiki.apache.org/spamassassin/ReportingSpam is ambiguous in 
this regard. Just mentions a message.txt.
I hope you can make something out of my uploaded mbox file :)

Thank you


After further testing, while sa-laern might support multiple emails in a single mbox 
file, "spamassassin -r" does not, so I cooked a perl script to split the mbox 
file into each individual mails, and then a simple loop is enough to report everything.

Goodbye.


Re: Problems with high spam

2009-09-21 Thread Aaron Wolfe
On Mon, Sep 21, 2009 at 11:34 AM, Martin Gregorie  wrote:
> On Mon, 2009-09-21 at 09:58 -0500, Jose Luis Marin Perez wrote:
>
>> I will implement improvements in the configuration  suggested and
>> observe the results, however, that more could be suggested to improve
>> my spam service?
>>
> I think you need to find out more about where your system resources are
> going.
>
> For starters, take a look at maillog (/var/log/maillog on my system) to
> check whether any SA child processes are timing out. If they are, you
> need to find out why processing those messages took so long and, if
> possible, speed that up, e.g. if RBL checks or domain name lookups are
> slow, consider running a local caching DNS.
>
> If that doesn't turn up anything obvious, use performance monitoring
> tools (sar, iostat, mpstat, etc) to see what is consuming the system
> resources: you have to know where and what the bottleneck(s) are before
> you can do anything about them. You can find these tools here:
>
> http://freshmeat.net/projects/sysstat/
>
> if they aren't part of your distro's package repository.
>
>
> Martin
>
>
>

Has there been any evidence that the OP's system is short on
resources?  If so I missed it.
The complaint was that too much spam is making it past the filter,
with a detection rate of only 54%.
This is not a very good percentage for a typical mail flow (if it is
actually accurate, i.e. not missing the mails rejected by RBLs or
RFC/syntax checks).

There were several issues with the configuration that kind people on
the list have pointed out.  Assuming these suggested changes have been
implemented, what is the detection rate now?

>From the posted local.cf, it is evident that the SA configuration is
not working very well.  There are many manually entered whitelist
rules, and also many manually added rules that score 100.  This is a
telltale sign of a very bad setup that is attempting to bandaid
instead of fixing the core issue.   And as pointed out before, both
the whitelist and the subject match -> 100 are very bad ideas.
Whitelisting the sender is so easily taken advantage of by spammers,
and those +100pts matches are sure to generate FPs.  Using rules this
way demonstrates lack of understanding in the way that SA is supposed
to work.  SA rules rarely attempt to kill a message in one shot (100
pts), instead they add or subtract a small amount from the score based
on likelyhood that a match means spam or ham.  Fine tuning, not
smashing with a hammer.

So, I think it is pretty safe to assume that the problem lies within
the SA configuration.

Maybe there are old rulesets that need to be updated.  Maybe not a
good selection of rulesets in the first place.  Perhaps this is an
"out of the box" configuration that has never been properly set up.

There are many good guides to setting up SA and supporting services
available online.  If the OP were to follow one of them to the letter,
I think the detection rate would be much improved.  Also some time
spent learning more about SA in general would allow the OP to fine
tune his config so that the current manual effort put into creating
hammer smashing rules is unneeded.

Good luck
-Aaron


Re: About reporting

2009-09-21 Thread João Eiras

On , Matus UHLAR - fantomas  wrote:


On , Theo Van Dinter  wrote:


On Sun, Sep 13, 2009 at 5:08 PM, João Eiras  wrote:

Should the file message.txt in the example contain the full -mail with
headers, attachments and everything ?


Yes.  It should be the original and complete message.


Does the reporting tool remove all information about the receiver for
privacy sake ?


No, nothing is removed from the message.


On 19.09.09 22:45, João Eiras wrote:

I still haven't got the answer to my last question, so here it goes again:
Can I report a full mbs file with many mails in one go ? Or should I split each 
mail on it's own file ?


sa-learn can accept mail in mbox format. It's even in its manual page.
Is this what you have meant?


Might be. I'm not familiar with spam assassin's internals, just a normal e-mail 
user wanting to report some spam.
A quick man sa-learn shows

   --mbox
   sa-learn will read in the file(s) containing the emails to be 
learned, and will process them in mbox format (one or more emails per file).

So, I have my answer.
The page at http://wiki.apache.org/spamassassin/ReportingSpam is ambiguous in 
this regard. Just mentions a message.txt.
I hope you can make something out of my uploaded mbox file :)

Thank you





Re: Re-running SA on an mbox

2009-09-21 Thread MySQL Student
Hi,

> IIRC you previously mentioned using Pine. Just in case you're not aware
> the default format for Pine/Alpine is MBX, an extended version of
> MBOX. You can tell the difference because MBX mailboxes start with a
> dummy email that's hidden by the software.

It seems that if you save messages into a separate folder it does not
add the DUMMY information at the top. I believe this is why the system
was set up to use "mbox" and not "mbx". Does this sound correct?

> I'd be very wary about allowing any tool to modify an MBX file unless
> you know it's safe. Where locking is an issue, Mark Crispin recommends
> that they only be accessed via the c-client library.

This isn't the actual spool file, but a copy in the home directory.

Thanks,
Alex


Understanding SpamAssassin

2009-09-21 Thread poifgh

I am trying to understand inner workings of spam assassin and would be great
if someone can answer my questions. I have read online documentation but
there are still some questions left unanswered or I am not sure about.

As far as I understand, the default configuration of spamassassin processes
emails in this fashion

DNSBL Tests ---> RAW Body Tests ---> Bayesian Learning --> AWL

[Is the sequence right? I know for sure AWL comes in last, what about
Bayesian learning and RAW Body tests' order? Did I miss any module?]

Why do we need Bayesian learning in presence of RAW body tests?

Mails which have very high or very low score are fed to bayesian learning.
Since we are confident about them being HAM or SPAM what do we want to learn
from them - The regex filters have identified that the mail is a spam (say),
what additional does bayesian learning achieve? Does it learn other words in
the spam mail (say words surrounding obfuscated term) in hope of matching
them in future emails? Or am I understanding it completely different?

Thnx for  help. 
-- 
View this message in context: 
http://www.nabble.com/Understanding-SpamAssassin-tp25530471p25530471.html
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.



Re: Eliminating russian spam

2009-09-21 Thread John Hardin

On Tue, 22 Sep 2009, Makoev Alan wrote:

I've written brief "how-to" for blocking E-mail in Russian. It's 
intended for those who are confident that any message in Russian sent to 
them is nothing but spam. See it here: 
http://sa-russian.narod.ru/no_russian.html I'd like to see SA experts 
opinions and advices.


  However, the message can be a MIME "multipart" one with charset
  declarations preceding the parts within the body, so this should be
  "full message" rule:

Not true, and bad advice (at least from a performance standpoint). Take a 
look at the mimeheader plugin and avoid using "full" rules.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  Ignorance doesn't make stuff not exist.   -- Bucky Katt
---
 19 days since a sunspot last seen - EPA blames CO2 emissions


Re: Understanding SpamAssassin

2009-09-21 Thread Bowie Bailey
poifgh wrote:
> I am trying to understand inner workings of spam assassin and would be great
> if someone can answer my questions. I have read online documentation but
> there are still some questions left unanswered or I am not sure about.
>   

I'm not an expert, just a long-time user, but I can give you some basic
answers.

> As far as I understand, the default configuration of spamassassin processes
> emails in this fashion
>
> DNSBL Tests ---> RAW Body Tests ---> Bayesian Learning --> AWL 
>
> [Is the sequence right? I know for sure AWL comes in last, what about
> Bayesian learning and RAW Body tests' order? Did I miss any module?]
>   

As I understand it, quite a bit of this is done in parallel.  In
particular, the DNS based tests are fired off first and then other tests
are run while waiting for the response.

In any case, unless you are playing with the shortcut features, all
rules are run for every message, so does it really matter what order
they are in?

> Why do we need Bayesian learning in presence of RAW body tests?
>
> Mails which have very high or very low score are fed to bayesian learning.
> Since we are confident about them being HAM or SPAM what do we want to learn
> from them - The regex filters have identified that the mail is a spam (say),
> what additional does bayesian learning achieve? Does it learn other words in
> the spam mail (say words surrounding obfuscated term) in hope of matching
> them in future emails? Or am I understanding it completely different?
>   

For auto-learning, the high and low scoring messages are fed to Bayes. 
However, for an optimal setup, you should manually train Bayes on as
much of your (verified) ham and spam as possible.  The more of your mail
stream Bayes sees, the better the results will be.

Your description of Bayes is pretty close.  It breaks down the message
into "tokens" (words and character sequences) and then keeps track of
how likely each of those tokens is to appear in either a ham or spam
message.  When a new message comes in, Bayes breaks it into tokens and
then scores it depending on which tokens were found in the message.

-- 
Bowie.


Understanding SpamAssassin

2009-09-21 Thread poifgh

I am trying to understand inner workings of spam assassin and would be great
if someone can answer my questions. I have read online documentation but
there are still some questions left unanswered or I am not sure about.

As far as I understand, the default configuration of spamassassin processes
emails in this fashion

DNSBL Tests ---> RAW Body Tests ---> Bayesian Learning --> AWL 

[Is the sequence right? I know for sure AWL comes in last, what about
Bayesian learning and RAW Body tests' order? Did I miss any module?]

Why do we need Bayesian learning in presence of RAW body tests?

Mails which have very high or very low score are fed to bayesian learning.
Since we are confident about them being HAM or SPAM what do we want to learn
from them - The regex filters have identified that the mail is a spam (say),
what additional does bayesian learning achieve? Does it learn other words in
the spam mail (say words surrounding obfuscated term) in hope of matching
them in future emails? Or am I understanding it completely different?

Thnx for help.
-- 
View this message in context: 
http://www.nabble.com/Understanding-SpamAssassin-tp25530437p25530437.html
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.



Re: Re-running SA on an mbox

2009-09-21 Thread MySQL Student
> but this will invalidtate dkim headers if this headers is signed, are
> spamassassin aware of this problem ? (in general)

Are you saying there is a bug?

> mutt -f mbox
>
> in mutt save to another folder if missclassified

Yes, I use pine for that, but would like to eliminate as many of the
FNs as possible, particularly ones that I can't determine visually.

Thanks,
Dave


Re: Re-running SA on an mbox

2009-09-21 Thread MySQL Student
Hi,

>> Thank you all for your help. The "mbox split" suggestion is a good
>> one. I'll follow that route and post my experience later.
>
> formail -s is the way to go.

I thought about that as a component of procmail. Sounds great.

Thanks,
Alex


Re: Re-running SA on an mbox

2009-09-21 Thread RW
On Sun, 20 Sep 2009 21:15:14 -0400
MySQL Student  wrote:

> Hi,
> 
> I have an mbox with about a 100 messages in it from a few days ago.
> The mbox is a combination of spam and ham. What is the best way to run
> SA through these messages again, so I can catch the ones that have
> URLs in them that weren't on the blacklist at the time they were
> received?

IIRC you previously mentioned using Pine. Just in case you're not aware
the default format for Pine/Alpine is MBX, an extended version of
MBOX. You can tell the difference because MBX mailboxes start with a
dummy email that's hidden by the software. 

I'd be very wary about allowing any tool to modify an MBX file unless
you know it's safe. Where locking is an issue, Mark Crispin recommends
that they only be accessed via the c-client library.



RE: Problems with high spam

2009-09-21 Thread Martin Gregorie
On Mon, 2009-09-21 at 09:58 -0500, Jose Luis Marin Perez wrote:

> I will implement improvements in the configuration  suggested and
> observe the results, however, that more could be suggested to improve
> my spam service? 
>
I think you need to find out more about where your system resources are
going. 

For starters, take a look at maillog (/var/log/maillog on my system) to
check whether any SA child processes are timing out. If they are, you
need to find out why processing those messages took so long and, if
possible, speed that up, e.g. if RBL checks or domain name lookups are
slow, consider running a local caching DNS.

If that doesn't turn up anything obvious, use performance monitoring
tools (sar, iostat, mpstat, etc) to see what is consuming the system
resources: you have to know where and what the bottleneck(s) are before
you can do anything about them. You can find these tools here:
 
http://freshmeat.net/projects/sysstat/

if they aren't part of your distro's package repository.


Martin




RE: Problems with high spam

2009-09-21 Thread Jose Luis Marin Perez

Dear Sirs, 

 I appreciate your help 

 Then the problem would not be the low ram? 


I will implement improvements in the configuration  suggested and
observe the results, however, that more could be suggested to improve
my spam service? 

 This is my current memory usage:

 total   used   free sharedbuffers cached
Mem:   501284216  0 24 41
-/+ buffers/cache:218282
Swap: 1027 59968

 Thanks for your time and support.

Jose Luis

> Subject: Re: Problems with high spam
> From: guent...@rudersport.de
> To: users@spamassassin.apache.org
> Date: Sat, 19 Sep 2009 18:15:14 +0200
> 
> On Sat, 2009-09-19 at 02:23 -0400, Aaron Wolfe wrote:
> > 2009/9/18 Karsten Bräckelmann:
> > > This machine NEEDS more RAM. In fact, I'd guess half of the spam
> > > slipping through is due to timeouts. Thrashing into hell.
> > 
> > throwing ram at a server is not a solution in this case.  512MB is
> > sufficient to handle this mail load, as indicated by his post showing
> > little swap utilization on the system and confirmed by my real world
> 
> You're right, Aaron, the output of 'free' suggests this is not actually
> a problem.
> 
> Alas, even though I asked repeatedly, this data point was given after
> that post of mine, and I was limited to very little info and some
> observations.
> 
> > experience. here we handle over 1 million messages per day per node,
> > each node has 1GB ram.   ram required is easily calculated by base
> > services + SA instance usage X number of instances you'd like to use.
> > having less instances generally just means slight (very slight in most
> > cases) delays.  having more instances than your ram can contain means
> > big delays.   properly configured server will not start swapping and
> > falling over when a flood of mail comes in, mail simply spends more
> > time in queue.  the difference between 1 second and 1 minute in queue
> > is not usually significant to users.
> > 
> > the problem here is bad administration.  hopefully with the advice
> > given on list and better yet some time spent studying docs, this can
> > be corrected.
> 
> -- 
> char *t="\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
> main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i (c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}
> 
  
_
Discover the new Windows Vista
http://search.msn.com/results.aspx?q=windows+vista&mkt=en-US&form=QBRE

Re: About reporting

2009-09-21 Thread Matus UHLAR - fantomas
> On , Theo Van Dinter  wrote:
>
>> On Sun, Sep 13, 2009 at 5:08 PM, João Eiras  wrote:
>>> Should the file message.txt in the example contain the full -mail with
>>> headers, attachments and everything ?
>>
>> Yes.  It should be the original and complete message.
>>
>>> Does the reporting tool remove all information about the receiver for
>>> privacy sake ?
>>
>> No, nothing is removed from the message.

On 19.09.09 22:45, João Eiras wrote:
> I still haven't got the answer to my last question, so here it goes again:
> Can I report a full mbs file with many mails in one go ? Or should I split 
> each mail on it's own file ?

sa-learn can accept mail in mbox format. It's even in its manual page.
Is this what you have meant?
-- 
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Eagles may soar, but weasels don't get sucked into jet engines. 


Re: Problems with high spam

2009-09-21 Thread Matus UHLAR - fantomas
> > also if using amavisd make its temp dir on ram speed up scanning and it
> > considered safe, mta have it on disk for the backup :)

On 19.09.09 00:56, MySQL Student wrote:
> How about mounting /var with noatime? Does anyone do that? Do you
> think it helps? What Linux filesystem is best suited for this? ext4?

only for huge filesystems with many filesm, e.g. proxy caches, mail queues,
news spools etc.

-- 
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Depression is merely anger without enthusiasm.