Re: About reporting
On tir 22 sep 2009 03:05:30 CEST, João Eiras wrote Goodbye. dont want to be here anymore ? -- xpoint
Re: Re-running SA on an mbox
On man 21 sep 2009 20:33:57 CEST, MySQL Student wrote but this will invalidtate dkim headers if this headers is signed, are spamassassin aware of this problem ? (in general) Are you saying there is a bug? partly yes, its not a bug as long you keep the orginal email but spamassassin --mbox < infile > outfile invalidate dkim signed mails no ? mutt -f mbox in mutt save to another folder if missclassified Yes, I use pine for that, but would like to eliminate as many of the FNs as possible, particularly ones that I can't determine visually. can pine sort mails from header contense ? if yes it will be less manuel work for you -- xpoint
Re: Re-running SA on an mbox
Hi, It's certainly not a fast operation, but using the following will split an mbox into individual messages: export FILENO=0 mkdir msgs formail -s sh -c 'cat - >msgs/$FILENO' < mbox-name.mbox I also created a loop that would strip all the SA headers from the messages: for file in *; do echo Processing: $file; spamassassin -d < $file > $file.txt; done This worked for a few hundred of the messages, but then started to fail on my production system with: [22135] warn: bayes: cannot open bayes databases /home/user/.spamassassin/bayes_* R/W: lock failed: File exists How can I tell when another process is using the database and when it is free for my script to use? Is there a faster way to run spamassassin just to strip the SA headers? Maybe there is a faster way, like passing the messages through the running amavisd instead of having to restart spamassassin each time to re-process each message? Thanks, Alex
Re: About reporting
On 19.09.09 22:45, João Eiras wrote: I still haven't got the answer to my last question, so here it goes again: Can I report a full mbs file with many mails in one go ? Or should I split each mail on it's own file ? sa-learn can accept mail in mbox format. It's even in its manual page. Is this what you have meant? Might be. I'm not familiar with spam assassin's internals, just a normal e-mail user wanting to report some spam. A quick man sa-learn shows --mbox sa-learn will read in the file(s) containing the emails to be learned, and will process them in mbox format (one or more emails per file). So, I have my answer. The page at http://wiki.apache.org/spamassassin/ReportingSpam is ambiguous in this regard. Just mentions a message.txt. I hope you can make something out of my uploaded mbox file :) Thank you After further testing, while sa-laern might support multiple emails in a single mbox file, "spamassassin -r" does not, so I cooked a perl script to split the mbox file into each individual mails, and then a simple loop is enough to report everything. Goodbye.
Re: Problems with high spam
On Mon, Sep 21, 2009 at 11:34 AM, Martin Gregorie wrote: > On Mon, 2009-09-21 at 09:58 -0500, Jose Luis Marin Perez wrote: > >> I will implement improvements in the configuration suggested and >> observe the results, however, that more could be suggested to improve >> my spam service? >> > I think you need to find out more about where your system resources are > going. > > For starters, take a look at maillog (/var/log/maillog on my system) to > check whether any SA child processes are timing out. If they are, you > need to find out why processing those messages took so long and, if > possible, speed that up, e.g. if RBL checks or domain name lookups are > slow, consider running a local caching DNS. > > If that doesn't turn up anything obvious, use performance monitoring > tools (sar, iostat, mpstat, etc) to see what is consuming the system > resources: you have to know where and what the bottleneck(s) are before > you can do anything about them. You can find these tools here: > > http://freshmeat.net/projects/sysstat/ > > if they aren't part of your distro's package repository. > > > Martin > > > Has there been any evidence that the OP's system is short on resources? If so I missed it. The complaint was that too much spam is making it past the filter, with a detection rate of only 54%. This is not a very good percentage for a typical mail flow (if it is actually accurate, i.e. not missing the mails rejected by RBLs or RFC/syntax checks). There were several issues with the configuration that kind people on the list have pointed out. Assuming these suggested changes have been implemented, what is the detection rate now? >From the posted local.cf, it is evident that the SA configuration is not working very well. There are many manually entered whitelist rules, and also many manually added rules that score 100. This is a telltale sign of a very bad setup that is attempting to bandaid instead of fixing the core issue. And as pointed out before, both the whitelist and the subject match -> 100 are very bad ideas. Whitelisting the sender is so easily taken advantage of by spammers, and those +100pts matches are sure to generate FPs. Using rules this way demonstrates lack of understanding in the way that SA is supposed to work. SA rules rarely attempt to kill a message in one shot (100 pts), instead they add or subtract a small amount from the score based on likelyhood that a match means spam or ham. Fine tuning, not smashing with a hammer. So, I think it is pretty safe to assume that the problem lies within the SA configuration. Maybe there are old rulesets that need to be updated. Maybe not a good selection of rulesets in the first place. Perhaps this is an "out of the box" configuration that has never been properly set up. There are many good guides to setting up SA and supporting services available online. If the OP were to follow one of them to the letter, I think the detection rate would be much improved. Also some time spent learning more about SA in general would allow the OP to fine tune his config so that the current manual effort put into creating hammer smashing rules is unneeded. Good luck -Aaron
Re: About reporting
On , Matus UHLAR - fantomas wrote: On , Theo Van Dinter wrote: On Sun, Sep 13, 2009 at 5:08 PM, João Eiras wrote: Should the file message.txt in the example contain the full -mail with headers, attachments and everything ? Yes. It should be the original and complete message. Does the reporting tool remove all information about the receiver for privacy sake ? No, nothing is removed from the message. On 19.09.09 22:45, João Eiras wrote: I still haven't got the answer to my last question, so here it goes again: Can I report a full mbs file with many mails in one go ? Or should I split each mail on it's own file ? sa-learn can accept mail in mbox format. It's even in its manual page. Is this what you have meant? Might be. I'm not familiar with spam assassin's internals, just a normal e-mail user wanting to report some spam. A quick man sa-learn shows --mbox sa-learn will read in the file(s) containing the emails to be learned, and will process them in mbox format (one or more emails per file). So, I have my answer. The page at http://wiki.apache.org/spamassassin/ReportingSpam is ambiguous in this regard. Just mentions a message.txt. I hope you can make something out of my uploaded mbox file :) Thank you
Re: Re-running SA on an mbox
Hi, > IIRC you previously mentioned using Pine. Just in case you're not aware > the default format for Pine/Alpine is MBX, an extended version of > MBOX. You can tell the difference because MBX mailboxes start with a > dummy email that's hidden by the software. It seems that if you save messages into a separate folder it does not add the DUMMY information at the top. I believe this is why the system was set up to use "mbox" and not "mbx". Does this sound correct? > I'd be very wary about allowing any tool to modify an MBX file unless > you know it's safe. Where locking is an issue, Mark Crispin recommends > that they only be accessed via the c-client library. This isn't the actual spool file, but a copy in the home directory. Thanks, Alex
Understanding SpamAssassin
I am trying to understand inner workings of spam assassin and would be great if someone can answer my questions. I have read online documentation but there are still some questions left unanswered or I am not sure about. As far as I understand, the default configuration of spamassassin processes emails in this fashion DNSBL Tests ---> RAW Body Tests ---> Bayesian Learning --> AWL [Is the sequence right? I know for sure AWL comes in last, what about Bayesian learning and RAW Body tests' order? Did I miss any module?] Why do we need Bayesian learning in presence of RAW body tests? Mails which have very high or very low score are fed to bayesian learning. Since we are confident about them being HAM or SPAM what do we want to learn from them - The regex filters have identified that the mail is a spam (say), what additional does bayesian learning achieve? Does it learn other words in the spam mail (say words surrounding obfuscated term) in hope of matching them in future emails? Or am I understanding it completely different? Thnx for help. -- View this message in context: http://www.nabble.com/Understanding-SpamAssassin-tp25530471p25530471.html Sent from the SpamAssassin - Users mailing list archive at Nabble.com.
Re: Eliminating russian spam
On Tue, 22 Sep 2009, Makoev Alan wrote: I've written brief "how-to" for blocking E-mail in Russian. It's intended for those who are confident that any message in Russian sent to them is nothing but spam. See it here: http://sa-russian.narod.ru/no_russian.html I'd like to see SA experts opinions and advices. However, the message can be a MIME "multipart" one with charset declarations preceding the parts within the body, so this should be "full message" rule: Not true, and bad advice (at least from a performance standpoint). Take a look at the mimeheader plugin and avoid using "full" rules. -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- Ignorance doesn't make stuff not exist. -- Bucky Katt --- 19 days since a sunspot last seen - EPA blames CO2 emissions
Re: Understanding SpamAssassin
poifgh wrote: > I am trying to understand inner workings of spam assassin and would be great > if someone can answer my questions. I have read online documentation but > there are still some questions left unanswered or I am not sure about. > I'm not an expert, just a long-time user, but I can give you some basic answers. > As far as I understand, the default configuration of spamassassin processes > emails in this fashion > > DNSBL Tests ---> RAW Body Tests ---> Bayesian Learning --> AWL > > [Is the sequence right? I know for sure AWL comes in last, what about > Bayesian learning and RAW Body tests' order? Did I miss any module?] > As I understand it, quite a bit of this is done in parallel. In particular, the DNS based tests are fired off first and then other tests are run while waiting for the response. In any case, unless you are playing with the shortcut features, all rules are run for every message, so does it really matter what order they are in? > Why do we need Bayesian learning in presence of RAW body tests? > > Mails which have very high or very low score are fed to bayesian learning. > Since we are confident about them being HAM or SPAM what do we want to learn > from them - The regex filters have identified that the mail is a spam (say), > what additional does bayesian learning achieve? Does it learn other words in > the spam mail (say words surrounding obfuscated term) in hope of matching > them in future emails? Or am I understanding it completely different? > For auto-learning, the high and low scoring messages are fed to Bayes. However, for an optimal setup, you should manually train Bayes on as much of your (verified) ham and spam as possible. The more of your mail stream Bayes sees, the better the results will be. Your description of Bayes is pretty close. It breaks down the message into "tokens" (words and character sequences) and then keeps track of how likely each of those tokens is to appear in either a ham or spam message. When a new message comes in, Bayes breaks it into tokens and then scores it depending on which tokens were found in the message. -- Bowie.
Understanding SpamAssassin
I am trying to understand inner workings of spam assassin and would be great if someone can answer my questions. I have read online documentation but there are still some questions left unanswered or I am not sure about. As far as I understand, the default configuration of spamassassin processes emails in this fashion DNSBL Tests ---> RAW Body Tests ---> Bayesian Learning --> AWL [Is the sequence right? I know for sure AWL comes in last, what about Bayesian learning and RAW Body tests' order? Did I miss any module?] Why do we need Bayesian learning in presence of RAW body tests? Mails which have very high or very low score are fed to bayesian learning. Since we are confident about them being HAM or SPAM what do we want to learn from them - The regex filters have identified that the mail is a spam (say), what additional does bayesian learning achieve? Does it learn other words in the spam mail (say words surrounding obfuscated term) in hope of matching them in future emails? Or am I understanding it completely different? Thnx for help. -- View this message in context: http://www.nabble.com/Understanding-SpamAssassin-tp25530437p25530437.html Sent from the SpamAssassin - Users mailing list archive at Nabble.com.
Re: Re-running SA on an mbox
> but this will invalidtate dkim headers if this headers is signed, are > spamassassin aware of this problem ? (in general) Are you saying there is a bug? > mutt -f mbox > > in mutt save to another folder if missclassified Yes, I use pine for that, but would like to eliminate as many of the FNs as possible, particularly ones that I can't determine visually. Thanks, Dave
Re: Re-running SA on an mbox
Hi, >> Thank you all for your help. The "mbox split" suggestion is a good >> one. I'll follow that route and post my experience later. > > formail -s is the way to go. I thought about that as a component of procmail. Sounds great. Thanks, Alex
Re: Re-running SA on an mbox
On Sun, 20 Sep 2009 21:15:14 -0400 MySQL Student wrote: > Hi, > > I have an mbox with about a 100 messages in it from a few days ago. > The mbox is a combination of spam and ham. What is the best way to run > SA through these messages again, so I can catch the ones that have > URLs in them that weren't on the blacklist at the time they were > received? IIRC you previously mentioned using Pine. Just in case you're not aware the default format for Pine/Alpine is MBX, an extended version of MBOX. You can tell the difference because MBX mailboxes start with a dummy email that's hidden by the software. I'd be very wary about allowing any tool to modify an MBX file unless you know it's safe. Where locking is an issue, Mark Crispin recommends that they only be accessed via the c-client library.
RE: Problems with high spam
On Mon, 2009-09-21 at 09:58 -0500, Jose Luis Marin Perez wrote: > I will implement improvements in the configuration suggested and > observe the results, however, that more could be suggested to improve > my spam service? > I think you need to find out more about where your system resources are going. For starters, take a look at maillog (/var/log/maillog on my system) to check whether any SA child processes are timing out. If they are, you need to find out why processing those messages took so long and, if possible, speed that up, e.g. if RBL checks or domain name lookups are slow, consider running a local caching DNS. If that doesn't turn up anything obvious, use performance monitoring tools (sar, iostat, mpstat, etc) to see what is consuming the system resources: you have to know where and what the bottleneck(s) are before you can do anything about them. You can find these tools here: http://freshmeat.net/projects/sysstat/ if they aren't part of your distro's package repository. Martin
RE: Problems with high spam
Dear Sirs, I appreciate your help Then the problem would not be the low ram? I will implement improvements in the configuration suggested and observe the results, however, that more could be suggested to improve my spam service? This is my current memory usage: total used free sharedbuffers cached Mem: 501284216 0 24 41 -/+ buffers/cache:218282 Swap: 1027 59968 Thanks for your time and support. Jose Luis > Subject: Re: Problems with high spam > From: guent...@rudersport.de > To: users@spamassassin.apache.org > Date: Sat, 19 Sep 2009 18:15:14 +0200 > > On Sat, 2009-09-19 at 02:23 -0400, Aaron Wolfe wrote: > > 2009/9/18 Karsten Bräckelmann: > > > This machine NEEDS more RAM. In fact, I'd guess half of the spam > > > slipping through is due to timeouts. Thrashing into hell. > > > > throwing ram at a server is not a solution in this case. 512MB is > > sufficient to handle this mail load, as indicated by his post showing > > little swap utilization on the system and confirmed by my real world > > You're right, Aaron, the output of 'free' suggests this is not actually > a problem. > > Alas, even though I asked repeatedly, this data point was given after > that post of mine, and I was limited to very little info and some > observations. > > > experience. here we handle over 1 million messages per day per node, > > each node has 1GB ram. ram required is easily calculated by base > > services + SA instance usage X number of instances you'd like to use. > > having less instances generally just means slight (very slight in most > > cases) delays. having more instances than your ram can contain means > > big delays. properly configured server will not start swapping and > > falling over when a flood of mail comes in, mail simply spends more > > time in queue. the difference between 1 second and 1 minute in queue > > is not usually significant to users. > > > > the problem here is bad administration. hopefully with the advice > > given on list and better yet some time spent studying docs, this can > > be corrected. > > -- > char *t="\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4"; > main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i (c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}} > _ Discover the new Windows Vista http://search.msn.com/results.aspx?q=windows+vista&mkt=en-US&form=QBRE
Re: About reporting
> On , Theo Van Dinter wrote: > >> On Sun, Sep 13, 2009 at 5:08 PM, João Eiras wrote: >>> Should the file message.txt in the example contain the full -mail with >>> headers, attachments and everything ? >> >> Yes. It should be the original and complete message. >> >>> Does the reporting tool remove all information about the receiver for >>> privacy sake ? >> >> No, nothing is removed from the message. On 19.09.09 22:45, João Eiras wrote: > I still haven't got the answer to my last question, so here it goes again: > Can I report a full mbs file with many mails in one go ? Or should I split > each mail on it's own file ? sa-learn can accept mail in mbox format. It's even in its manual page. Is this what you have meant? -- Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/ Warning: I wish NOT to receive e-mail advertising to this address. Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu. Eagles may soar, but weasels don't get sucked into jet engines.
Re: Problems with high spam
> > also if using amavisd make its temp dir on ram speed up scanning and it > > considered safe, mta have it on disk for the backup :) On 19.09.09 00:56, MySQL Student wrote: > How about mounting /var with noatime? Does anyone do that? Do you > think it helps? What Linux filesystem is best suited for this? ext4? only for huge filesystems with many filesm, e.g. proxy caches, mail queues, news spools etc. -- Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/ Warning: I wish NOT to receive e-mail advertising to this address. Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu. Depression is merely anger without enthusiasm.