Re: Increase in Image Spam
On 2/20/2014 10:35 PM, Amir Caspi wrote: On Feb 20, 2014, at 8:07 PM, Kevin A. McGrail wrote: No need to run through 3.3.2. The emails are well over the 256KB limit hard coded in sa-learn with 3.3.2. Understood, and thanks for checking on this. Now that I know this is the problem, I've manually edited Mail::SpamAssassin::ArchiveIterator.pm to change the BIG_BYTES limit from 256K to 1500K (which I've found is a reasonable size for my small system). I've verified that this change allows sa-learn to work properly for these messages. Is there any reason that such a manual edit could cause problems elsewhere, or am I safe to have made this change? (Neglect the fact that large messages could cause high loads, my system can handle that.) Or, would you recommend that instead of making this change, I just set opt_all => 1 in sa-learn's instantiation of ArchiveIterator? (That is, modify sa-learn instead of ArchiveIterator.) I don't know, sorry. Let us know if you find any issues for sure. Now, that brings up the other question: I have other mails that are well below the 256K limit (and certainly below the 1500K limit I just made), but they are still not being examined by sa-learn. These messages are pretty old (from July 2013) ... are they being ignored because they are too old? I don't see that sa-learn is using opt_before or opt_after for Archive_Iterator, and I don't see anywhere else where it's excluding old messages... and there are no errors in the debug output, but I'm still getting "0 message examined." This sample mbox of old mails is here: https://www.dropbox.com/s/zvbmvk8pb06v0m8/SA_testspam_old.mbox If it's being ignored based on date, how would I know that? Sorry for being dense. =) The file isn't in mbox format. No From separators. Regards, KAM
Re: Increase in Image Spam
On Feb 20, 2014, at 8:07 PM, Kevin A. McGrail wrote: > No need to run through 3.3.2. The emails are well over the 256KB limit hard > coded in sa-learn with 3.3.2. Understood, and thanks for checking on this. Now that I know this is the problem, I've manually edited Mail::SpamAssassin::ArchiveIterator.pm to change the BIG_BYTES limit from 256K to 1500K (which I've found is a reasonable size for my small system). I've verified that this change allows sa-learn to work properly for these messages. Is there any reason that such a manual edit could cause problems elsewhere, or am I safe to have made this change? (Neglect the fact that large messages could cause high loads, my system can handle that.) Or, would you recommend that instead of making this change, I just set opt_all => 1 in sa-learn's instantiation of ArchiveIterator? (That is, modify sa-learn instead of ArchiveIterator.) Now, that brings up the other question: I have other mails that are well below the 256K limit (and certainly below the 1500K limit I just made), but they are still not being examined by sa-learn. These messages are pretty old (from July 2013) ... are they being ignored because they are too old? I don't see that sa-learn is using opt_before or opt_after for Archive_Iterator, and I don't see anywhere else where it's excluding old messages... and there are no errors in the debug output, but I'm still getting "0 message examined." This sample mbox of old mails is here: https://www.dropbox.com/s/zvbmvk8pb06v0m8/SA_testspam_old.mbox If it's being ignored based on date, how would I know that? Sorry for being dense. =) Thanks. --- Amir
Re: Increase in Image Spam
On 2/20/2014 7:18 PM, Amir 'CG' Caspi wrote: If you have a chance, please run it through both 3.3.2 and 3.4.0, to see if there's a difference... clearly, it's not working on _MY_ 3.3.2 for some reason! I sent the exact commands that I used in a prior email a couple of hours ago. Thanks. =) --- Amir No need to run through 3.3.2. The emails are well over the 256KB limit hard coded in sa-learn with 3.3.2. 3.4.0: sa-learn -D --mbox --progress --spam < /tmp/temp.mbox 2>&1 | tee /tmp/output Feb 20 21:51:33.484 [21525] dbg: archive-iterator: _run_mailbox /tmp/.spamassassin2152599LqEKtmp, ofs 0, limit 262144 Feb 20 21:51:33.500 [21525] info: archive-iterator: skipping large message: 4089 lines, 262160 bytes, limit 262144 bytes Feb 20 21:51:33.501 [21525] dbg: archive-iterator: _run_mailbox /tmp/.spamassassin2152599LqEKtmp, ofs 429849, limit 262144 Feb 20 21:51:33.517 [21525] info: archive-iterator: skipping large message: 4088 lines, 262169 bytes, limit 262144 bytes Re-running with a limit high enough to sa-learn -D --mbox --progress --spam < /tmp/temp.mbox --max-size=60 2>&1 | tee /tmp/output Learned tokens from 2 message(s) (2 message(s) examined) Output from debug and everything ;-) regards, KAM
Re: Increase in Image Spam
On Thu, 20 Feb 2014, Ian Zimmerman wrote: On Thu, 20 Feb 2014 11:57:17 -0800 (PST) John Hardin wrote: Amir> When I run sa-learn on this mailbox, it says: Amir> Learned tokens from 0 message(s) (0 message(s) examined) John> "0 messages examined" generally means either the format isn't what John> sa-learn expected, or the message is larger than the size limit. In my case it usually means the message has been learned already and SA just refuses to do so for the 2nd time :-) That would be "learned tokens from 0 messages (n > 0 messages examined)". -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- You do not examine legislation in the light of the benefits it will convey if properly administered, but in the light of the wrongs it would do and the harms it would cause if improperly administered. -- Lyndon B. Johnson --- 2 days until George Washington's 282nd Birthday
Re: Increase in Image Spam
On Feb 20, 2014, at 7:07 PM, Ian Zimmerman wrote: > In my case it usually means the message has been learned already and SA > just refuses to do so for the 2nd time :-) When I run sa-learn on already-learned messages, it says 0 tokens learned, but it still says N messages examined (where N > 0). That is, it _examines_ the messages, but does not learn from them, because they were already processed. In this case, it's not even examining the messages, which is a different problem. Thanks. --- Amir
Re: Increase in Image Spam
On Thu, 20 Feb 2014 11:57:17 -0800 (PST) John Hardin wrote: Amir> When I run sa-learn on this mailbox, it says: Amir> Learned tokens from 0 message(s) (0 message(s) examined) John> "0 messages examined" generally means either the format isn't what John> sa-learn expected, or the message is larger than the size limit. In my case it usually means the message has been learned already and SA just refuses to do so for the 2nd time :-) -- Please *no* private copies of mailing list or newsgroup messages. gpg public key: 2048R/984A8AE4 fingerprint: 7953 ADA1 0E8E AB57 FB79 FFD2 360A 88B2 984A 8AE4 Funny pic: http://bit.ly/ZNE2MX
Re: Increase in Image Spam
On Thu, February 20, 2014 5:13 pm, Kevin A. McGrail wrote: > Resend the mbox.link and I will likely have a cycle to throw it through. https://www.dropbox.com/s/m4fuv670wnvwa16/SA_testspam.mbox To be deleted in 24-48 hours (don't want spammers harvesting it). If you have a chance, please run it through both 3.3.2 and 3.4.0, to see if there's a difference... clearly, it's not working on _MY_ 3.3.2 for some reason! I sent the exact commands that I used in a prior email a couple of hours ago. Thanks. =) --- Amir
Re: Increase in Image Spam
Resend the mbox.link and I will likely have a cycle to throw it through. Regards, KAM Amir 'CG' Caspi wrote: >On Thu, February 20, 2014 4:08 pm, Kevin A. McGrail wrote: >> Probably best if you install 3.4.0 (or even trunk) on a test system >and >> throw the offending email onto that server and run sa-learn on that >box >> with -D. > >In the meantime, anyone want to do it on my behalf? =) I provided the >mbox link earlier; I unfortunately do not have a test system available. > >(I'm not quite a professional sysadmin...) > > --- Amir
Re: Increase in Image Spam
On Thu, February 20, 2014 4:08 pm, Kevin A. McGrail wrote: > Probably best if you install 3.4.0 (or even trunk) on a test system and > throw the offending email onto that server and run sa-learn on that box > with -D. In the meantime, anyone want to do it on my behalf? =) I provided the mbox link earlier; I unfortunately do not have a test system available. (I'm not quite a professional sysadmin...) --- Amir
Re: Increase in Image Spam
On 2/20/2014 6:01 PM, Amir 'CG' Caspi wrote: On Thu, February 20, 2014 3:52 pm, Kevin A. McGrail wrote: Questions that will be answered by "that is solved in 3.4.0" aren't really going to get much support from me... Understood, though it'll be a while before I can upgrade to 3.4 due to the RPM issue that I've mentioned previously. However, I Googled this issue before mailing and this iterator error you posted SHOULD appear in sa-learn even in 3.3.x, but it does not seem to. More to the point, when trying to run on a spam that had previously worked fine with v3.3.1, sa-learn STILL says "0 messages examined" and that spam is only 4K, so there's no chance it's running up against the max-size limit. (On the other hand, that spam is many months old -- does sa-learn have a date limit as well? If so, is that customizable?) Probably best if you install 3.4.0 (or even trunk) on a test system and throw the offending email onto that server and run sa-learn on that box with -D. Then we can start discussing apples to apples and add more debugging if needed. regards, KAM
Re: Increase in Image Spam
On Thu, February 20, 2014 3:52 pm, Kevin A. McGrail wrote: > Questions that will be answered by "that is solved in 3.4.0" aren't > really going to get much support from me... Understood, though it'll be a while before I can upgrade to 3.4 due to the RPM issue that I've mentioned previously. However, I Googled this issue before mailing and this iterator error you posted SHOULD appear in sa-learn even in 3.3.x, but it does not seem to. More to the point, when trying to run on a spam that had previously worked fine with v3.3.1, sa-learn STILL says "0 messages examined" and that spam is only 4K, so there's no chance it's running up against the max-size limit. (On the other hand, that spam is many months old -- does sa-learn have a date limit as well? If so, is that customizable?) --- Amir
Re: Increase in Image Spam
On 2/20/2014 5:48 PM, Martin Gregorie wrote: On Thu, 2014-02-20 at 17:29 -0500, Kevin A. McGrail wrote: More to the point, spamc would have to process all config files first which would slow it down. The point of spamc is to be a VERY lightweight connection to spamd. That's why I suggested that spamc could be handed that value by spamd before it ships the message over. I had the same suggestion. "If you really want this, I'd say off the cuff you should implement a new version of the spamc protocol and have the spamc/spamd negotiate whether the connection was going to be accepted by sending the message size ahead of time coupled with a local.cf option for the spamd max message size. You can open a feature request for this at bugzilla and I'd be happy to help testing any patches you might come up with." So in short, if you like the idea, take a whack at the code and make a patch. regards, KAM
Re: Increase in Image Spam
On 2/20/2014 5:38 PM, Amir 'CG' Caspi wrote: On Thu, February 20, 2014 3:29 pm, Kevin A. McGrail wrote: Unifying wouldn't be something I would want to see. Well, no one is arguing to _force_ unification, but to provide an option for it. That is, max-size could be set in local.cf and would become a global parameter, but could still be overridden with CLI options. I think based on your idea below, the CLI could not override it. For example, if I have a max size on spamd, how could spamc override it? Right now, spamd has no limit and spamc enforces a limit. As with before, just because *I* don't want to see it just means that you have to figure it out on your own and come up with a patch that doesn't break existing functionality but adds what you want. And I'm willing to discuss the concept and test patches because I see some merit. But I know I have other issues I want to focus on right now with SA. Typically if you were using spamassassin, a size limit it would be implemented by your .procmailrc implementation for example. Well, at least in 3.3.2, there is no apparent max-size parameter for spamassassin (the direct SA executable, not spamc/spamd or sa-learn). Older messages from the archives of this very mailing list seems to suggest that spamassassin itself has no message size limit. That is correct. There is no size limit in that scenario of using the spamassassin executable directly. However, in the real-world, there is almost 0 necessity to use that in a live environment because the startup time is too high. As noted, if you were using spamassassin, you would likely using something like .procmailrc with a rule that limits the size ala: :0fw * < 1572864 spamc certainly does, which as you say is overridden with the -s parameter. sa-learn apparently has a hardcoded limit, although as I mentioned in my previous email, I'm not seeing any error in the debug output that it's skipping due to size. Please try with 3.4.0 and if there is still no output in debug, let me know and I'll add something. But from looking at the code, I believe this is addressed: info("archive-iterator: skipping large message: ". "file size %d, limit %d bytes", -s _, $opt_max_size); Questions that will be answered by "that is solved in 3.4.0" aren't really going to get much support from me... More to the point, spamc would have to process all config files first which would slow it down. The point of spamc is to be a VERY lightweight connection to spamd. Actually, if a limit is imposed centrally in spamd, I think this could be accomplished without any changes to spamc except to remove spamc's default size limit. spamc would remain lightweight, simply piping email to spamd... if the message exceeds spamd's size limit, spamd would simply regurgitate the X-Spam-Status: No header, which is exactly what spamc currently does locally when the message size limit is exceeded -- the difference is only that spamc would send the message to spamd and spamd would barf, rather than spamc barfing locally. Only spamd would have to read its central config. (A local size limit COULD still be imposed for spamc via CLI, the difference is that no local size limit would exist by default, it would have to be done via CLI.) More to the point though, a local size limit SHOULD be imposed. Do you really want Spamc sending giant messages to spamd just to have it say, no, that's too large? If you really want this, I'd say off the cuff you should implement a new version of the spamc protocol and have the spamc/spamd negotiate whether the connection was going to be accepted by sending the message size ahead of time coupled with a local.cf option for the spamd max message size. You can open a feature request for this at bugzilla and I'd be happy to help testing any patches you might come up with. However, in my case, I use spamc and multiple factors to determine what the max size is to send to spamd. For example, if our load average is very low, I will send very large messages to spamd. I enjoy the flexibility of the setting. regards, KAM
Re: Increase in Image Spam
On 2014-02-20 23:16, Kevin A. McGrail wrote: Are you using 3.4.0? I believe the size was hard-coded until then when the max-size option was added to sa-learn. SpamAssassin 3.4.0 (2014-02-07) yes i do ebuilds for gentoo self 3.4 is not in gentoo yet Kevin: do i need to be reply private here ?
Re: Increase in Image Spam
On Thu, 2014-02-20 at 17:29 -0500, Kevin A. McGrail wrote: > More to the point, spamc would have to process all config files first > which would slow it down. The point of spamc is to be a VERY > lightweight connection to spamd. > That's why I suggested that spamc could be handed that value by spamd before it ships the message over. This is or should be lightweight: in the past I was able to get 25,000 request/responses per second from a process that was answering queries against a large (500k entry) in-memory red/black btree. This was on a single core 625 MHz AlphaServer with both processes on the same box. IOW the cost per message pair was comfortably under 40mS once the time needed to search the btree is subtracted. Most present-day servers should do considerably better. Martin
Re: Increase in Image Spam
On Thu, February 20, 2014 3:29 pm, Kevin A. McGrail wrote: > Unifying wouldn't be something I would want to see. Well, no one is arguing to _force_ unification, but to provide an option for it. That is, max-size could be set in local.cf and would become a global parameter, but could still be overridden with CLI options. > Typically if you were using spamassassin, a size limit it would be > implemented by your .procmailrc implementation for example. Well, at least in 3.3.2, there is no apparent max-size parameter for spamassassin (the direct SA executable, not spamc/spamd or sa-learn). Older messages from the archives of this very mailing list seems to suggest that spamassassin itself has no message size limit. spamc certainly does, which as you say is overridden with the -s parameter. sa-learn apparently has a hardcoded limit, although as I mentioned in my previous email, I'm not seeing any error in the debug output that it's skipping due to size. > More to the point, spamc would have to process all config files first > which would slow it down. The point of spamc is to be a VERY > lightweight connection to spamd. Actually, if a limit is imposed centrally in spamd, I think this could be accomplished without any changes to spamc except to remove spamc's default size limit. spamc would remain lightweight, simply piping email to spamd... if the message exceeds spamd's size limit, spamd would simply regurgitate the X-Spam-Status: No header, which is exactly what spamc currently does locally when the message size limit is exceeded -- the difference is only that spamc would send the message to spamd and spamd would barf, rather than spamc barfing locally. Only spamd would have to read its central config. (A local size limit COULD still be imposed for spamc via CLI, the difference is that no local size limit would exist by default, it would have to be done via CLI.) Cheers. --- Amir
Re: Increase in Image Spam
I think you were just on the email chain on list so my reply to another person went to you. On 2/20/2014 5:21 PM, Benny Pedersen wrote: On 2014-02-20 23:16, Kevin A. McGrail wrote: Are you using 3.4.0? I believe the size was hard-coded until then when the max-size option was added to sa-learn. SpamAssassin 3.4.0 (2014-02-07) yes i do ebuilds for gentoo self 3.4 is not in gentoo yet Kevin: do i need to be reply private here ? -- Kevin A. McGrail President Peregrine Computer Consultants Corporation 3927 Old Lee Highway, Suite 102-C Fairfax, VA 22030-2422 http://www.pccc.com/ 703-359-9700 x50 / 800-823-8402 (Toll-Free) 703-359-8451 (fax) kmcgr...@pccc.com
Re: Increase in Image Spam
On Thu, February 20, 2014 3:16 pm, Kevin A. McGrail wrote: > Are you using 3.4.0? I believe the size was hard-coded until then when > the max-size option was added to sa-learn. No, as mentioned previously in this flurry of emails, I'm using 3.3.2. However, note that using spamassassin directly (not learning, just classifying) works just fine, there is no complaint of max message size. Using spamc with --max-size, no complaints either. And, finally, sa-learn with -D (debug) does not show me any error messages or warnings related to message size, or ANYTHING in fact that would lead me to understand why it's skipping these messages. If they exceed the maximum size, sa-learn is being very quiet about it and not throwing an explicit error in the debug output. I echo Martin's question of whether it's possible to override the max size in local.cf, because on my system (with virtual hosts that call spamc) that would be much more preferable than having to specify max-size in every virtual host's /etc/procmailrc (which is how I have to do it now). Thanks. --- Amir
Re: Increase in Image Spam
On 2/20/2014 5:16 PM, Martin Gregorie wrote: On Thu, 2014-02-20 at 16:39 -0500, Kevin A. McGrail wrote: On 2/20/2014 4:35 PM, Amir 'CG' Caspi wrote: If it's a size issue, how can I increase the size limit for sa-learn? But, I don't think it's a size issue since these messages are under 512k each. --max-size= I believe. Default is 256K. Sorry, no. According to my manpage (SA 3.3.2) there is no --max-size option and (second try) sa-learn --max-size is rejected as an unknown option. Try 3.4.0 --max-size Skip messages larger than b bytes; defaults to 256 KiB, 0 implies no limit I'll fix KiB to read KB. On the same subject, is there any change that a max-size configuration parameter could be supplied via local.cf? Don't believe so. 1) IMO a single central setting is better than remembering to specify and change it in several scripts. Currently it needs to be set to the same value in every script or MTA configuration that can run spamc and/or sa-learn and its quite easy to miss one. My systems run with different limits in different places and in fact on different servers with spamc connecting to spamd boxes on other systems. Unifying wouldn't be something I would want to see. 2) There currently seems to be no way of overriding the default max message size for the commands spamassassin, spamd or sa-learn. I believe this is false. Typically if you were using spamassassin, a size limit it would be implemented by your .procmailrc implementation for example. Spamd would be limited by spamc -s parameter. sa-learn has the --max-size option added with 3.4.0 3) It improves system documentation to have all parameter settings in one place. SA is an API as well as a collection of programs implementing the API. It's a Swiss army tool with a whole bunch of configurable settings. And, as in my case, many of the tools can run on different servers by different users, etc. One place for parameters is very hard. But if you want to discuss further and can provide patches that don't break existing functionality, I'm always looking to get more people involved and submitting patches. I accept that setting the message size in local.cf may slow spamc down slightly if spamd doesn't already send a reply to spamc, which could pass the setting back, before accepting the message but the overhead of adding the reply message should be quite small. More to the point, spamc would have to process all config files first which would slow it down. The point of spamc is to be a VERY lightweight connection to spamd. regards, KAM
Re: Increase in Image Spam
On Thu, 2014-02-20 at 16:39 -0500, Kevin A. McGrail wrote: > On 2/20/2014 4:35 PM, Amir 'CG' Caspi wrote: > > If it's a size issue, how can I increase the size limit for sa-learn? > > But, I don't think it's a size issue since these messages are under 512k > > each. > --max-size= I believe. Default is 256K. > Sorry, no. According to my manpage (SA 3.3.2) there is no --max-size option and (second try) sa-learn --max-size is rejected as an unknown option. On the same subject, is there any change that a max-size configuration parameter could be supplied via local.cf? Reasons: 1) IMO a single central setting is better than remembering to specify and change it in several scripts. Currently it needs to be set to the same value in every script or MTA configuration that can run spamc and/or sa-learn and its quite easy to miss one. 2) There currently seems to be no way of overriding the default max message size for the commands spamassassin, spamd or sa-learn. 3) It improves system documentation to have all parameter settings in one place. I accept that setting the message size in local.cf may slow spamc down slightly if spamd doesn't already send a reply to spamc, which could pass the setting back, before accepting the message but the overhead of adding the reply message should be quite small. Martin
Re: Increase in Image Spam
On 2/20/2014 5:07 PM, Amir 'CG' Caspi wrote: On Thu, February 20, 2014 2:49 pm, Benny Pedersen wrote: On 2014-02-20 22:39, Kevin A. McGrail wrote: --max-size= I believe. Default is 256K. sa-learn barfs, that flag is not accepted. That flag works for spamc, but not for sa-learn. sa-learn man page and CLI help don't have any mention of a max message size. Are you using 3.4.0? I believe the size was hard-coded until then when the max-size option was added to sa-learn.
Re: Increase in Image Spam
On Thu, February 20, 2014 2:49 pm, Benny Pedersen wrote: > On 2014-02-20 22:39, Kevin A. McGrail wrote: >> --max-size= I believe. Default is 256K. sa-learn barfs, that flag is not accepted. That flag works for spamc, but not for sa-learn. sa-learn man page and CLI help don't have any mention of a max message size. > and small mbox files exists, it could just be missing --mbox on > commandline else it would use maildir as default Here is the exact command I am running, and the exact output: -bash-3.2$ file SA_testspam.mbox testspam: ASCII mail text -bash-3.2$ sa-learn --mbox --progress --spam SA_testspam.mbox Learned tokens from 0 message(s) (0 message(s) examined) As you can see, it is an MBOX file, and I'm passing the --mbox flag, it just doesn't like these two messages. (To reiterate, adding a few other spams results in THOSE spams getting considered, but these two messages still being ignored.) Very strange. --- Amir
Re: Increase in Image Spam
On 2014-02-20 22:56, Amir 'CG' Caspi wrote: I run a virtual-hosting server where the individual site RPMs are copied from server-level RPMs. Basically all software has to be installed as RPMs in order to propagate to the individual virtual hosts. google on dist2rpm, you basicly just use source from cpan to make rpms, when rpms is build update like you always do in centos i just still dont understand centos people not make it self more natively create the spec file and rebuild with a src rpms if cpan is not an option
Re: Increase in Image Spam
On Thu, February 20, 2014 2:39 pm, Axb wrote: > what's wrong with installing from source? I run a virtual-hosting server where the individual site RPMs are copied from server-level RPMs. Basically all software has to be installed as RPMs in order to propagate to the individual virtual hosts. --- Amir
Re: Increase in Image Spam
On 2014-02-20 22:39, Kevin A. McGrail wrote: On 2/20/2014 4:35 PM, Amir 'CG' Caspi wrote: If it's a size issue, how can I increase the size limit for sa-learn? But, I don't think it's a size issue since these messages are under 512k each. --max-size= I believe. Default is 256K. and small mbox files exists, it could just be missing --mbox on commandline else it would use maildir as default
Re: Increase in Image Spam
On 2014-02-20 22:39, Axb wrote: noticed? (I can't install 3.4 since it hasn't been RPM'd for CentOS 5.x yet.) what's wrong with installing from source? (NOT Cpan install) http://searchcode.com/codesearch/view/21483839 the harddest part is to know howto :=)
Re: Increase in Image Spam
On 2/20/2014 4:39 PM, Axb wrote: On 02/20/2014 10:35 PM, Amir 'CG' Caspi wrote: Note that I have some other spams for which this is now an issue but which I think worked fine in the past (with SA 3.3.1 for sure); is it possible something got borked in sa-learn between 3.3.1 and 3.3.2 and nobody noticed? (I can't install 3.4 since it hasn't been RPM'd for CentOS 5.x yet.) what's wrong with installing from source? (NOT Cpan install) Theoretically CPAN install should work now as well though FreeBSD users will need to wait for the 3.4.1 release to install cleanly due to a variable collision (script). Regards, KAM
Re: Increase in Image Spam
On 2/20/2014 4:35 PM, Amir 'CG' Caspi wrote: If it's a size issue, how can I increase the size limit for sa-learn? But, I don't think it's a size issue since these messages are under 512k each. --max-size= I believe. Default is 256K.
Re: Increase in Image Spam
On 02/20/2014 10:35 PM, Amir 'CG' Caspi wrote: Note that I have some other spams for which this is now an issue but which I think worked fine in the past (with SA 3.3.1 for sure); is it possible something got borked in sa-learn between 3.3.1 and 3.3.2 and nobody noticed? (I can't install 3.4 since it hasn't been RPM'd for CentOS 5.x yet.) what's wrong with installing from source? (NOT Cpan install)
Re: Increase in Image Spam
On Thu, February 20, 2014 12:57 pm, John Hardin wrote: > "0 messages examined" generally means either the format isn't what > sa-learn expected, or the message is larger than the size limit. The file format is most certainly MBOX... it was created by my MUA, and running "file" on it tells me that it is "ASCII mail text." As I mentioned, adding other spams to it results in those other spams being properly learned, so it can't be a format issue unless the specific messages themselves are not formatted in a way that sa-learn likes (though the MTA and MUA like it just fine). If it's a size issue, how can I increase the size limit for sa-learn? But, I don't think it's a size issue since these messages are under 512k each. Note that I have some other spams for which this is now an issue but which I think worked fine in the past (with SA 3.3.1 for sure); is it possible something got borked in sa-learn between 3.3.1 and 3.3.2 and nobody noticed? (I can't install 3.4 since it hasn't been RPM'd for CentOS 5.x yet.) I tried running sa-learn -D but the debug output didn't tell me anything (that I could see) about why it was skipping the messages. Running spamassassin on the messages works just fine (I see SA output, so it's matching rules), as does running spamc/spamd. It is only sa-learn that seems to be choking, and I have no idea why. Any additional suggestions on how I can diagnose this? Is it looking like something I can fix, or a bug in sa-learn? Thanks. --- Amir
Re: Increase in Image Spam
On 2014-02-20 21:43, Axb wrote: Redis DB in RAM - do the math :) got results as 781250 now its time to see how much power so many pi' is using :=) have anyone thinked about running mysql in memory ?, if its slow? engine=memory in the spamd init script, and engine=myisam on shutdown yes i know its risky, but would be nice to see comparisons
Re: Increase in Image Spam
On 02/20/2014 07:46 PM, Benny Pedersen wrote: On 2014-02-20 19:34, Axb wrote: well, not huge...let me brag :) sa-learn --dump magic 0.000 0 3 0 non-token data: bayes db version 0.000 0 17663091 0 non-token data: nspam 0.000 06768342 0 non-token data: nham how many raspberry-pi is needed in cluster setup to handle this ? :=) # Memory used_memory:4072212184 used_memory_human:3.79G used_memory_rss:4163964928 used_memory_peak:4076821712 used_memory_peak_human:3.80G Redis DB in RAM - do the math :)
Re: Increase in Image Spam
On Thu, 20 Feb 2014, Amir Caspi wrote: When I run sa-learn on this mailbox, it says: Learned tokens from 0 message(s) (0 message(s) examined) "0 messages examined" generally means either the format isn't what sa-learn expected, or the message is larger than the size limit. -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- From the Liberty perspective, it doesn't matter if it's a jackboot or a Birkenstock smashing your face. -- Robb Allen --- 2 days until George Washington's 282nd Birthday
Re: Increase in Image Spam
On Feb 20, 2014, at 11:21 AM, Kris Deugau wrote: > Have you tried learning one specific FN, then reprocessing that message > to see what Bayes score it gets? IME it will usually shift from > BAYES_00 to at least BAYES_40 in most cases, even with a large sitewide > DB with far more tokens than the usual per-user DB. Well, I just tried this, and sa-learn seems to be refusing to learn the messages. I've placed an example MBOX here, temporarily (I will delete this within the next 24-48 hours for security): https://www.dropbox.com/s/m4fuv670wnvwa16/SA_testspam.mbox When I run sa-learn on this mailbox, it says: Learned tokens from 0 message(s) (0 message(s) examined) (This is using SA 3.3.2 on a CentOS 5.10 box.) I tried placing other spam in here and it learned those fine, so clearly something about these two messages is confusing sa-learn. Anyone have an idea why sa-learn is refusing to even examine these messages? (Note that the messages are out of order; the first one is newer than the second. The older one scored Bayes_50, the newer one scored Bayes_00.) Any thoughts are greatly appreciated, I don't know why sa-learn won't even touch these... and that may explain why they continue to have low scores! --- Amir
Re: Increase in Image Spam
On 2014-02-20 19:34, Axb wrote: well, not huge...let me brag :) sa-learn --dump magic 0.000 0 3 0 non-token data: bayes db version 0.000 0 17663091 0 non-token data: nspam 0.000 06768342 0 non-token data: nham how many raspberry-pi is needed in cluster setup to handle this ? :=) /me hiddes
Re: Increase in Image Spam
On 02/20/2014 06:44 PM, Amir Caspi wrote: On Feb 20, 2014, at 10:34 AM, Axb wrote: I hope you're running SA 3.4 so: I am still on 3.3.2 because nobody has yet packaged 3.4 for CentOS 5.x, from what I can tell. I have the package from the rpmforge-extras repo, and 3.3.2 is still the most current version there (and on Atomic and AtRPMs). I'm not sure who is responsible for updating the packages, but I'll probably have to wait a while until they get 3.4 uploaded there. Assuming you can check maillogs and can either detect some spammed unknown user patterns or have a dedicated trap domain to spare, I'd accept that mail and write some header rules to score the trap rcpt/domain REAL high and use a rule like tflags RULENAME autolearn_force I'm not entirely sure what you mean here. Are you saying to use a honeypot/spamtrap to feed the Bayes DB? yep, exactly. My problem is not that my Bayes DB doesn't have enough spam in it, it's that these particular FNs are scoring 00. Let me note that the Bayes DBs are per-user, not per-domain. Here's the magic output from my Bayes DB: Personally I wouldn't use /user bayes DB but site wide so all users will have the benefit of your trapped data/learnt spam I'd bet you'd see a major improvement in spam detection and no FPs. I don't think this counts as a "small" DB, does it? well, not huge...let me brag :) sa-learn --dump magic 0.000 0 3 0 non-token data: bayes db version 0.000 0 17663091 0 non-token data: nspam 0.000 06768342 0 non-token data: nham Bayes is set to autolearn, and I manually run sa-learn about once a week on my spam folder (to learn the FNs, plus lower-scoring spam that was not autolearned). MANY such image spams are caught properly, including by Bayes; the problem is that some of them, somehow, manage to slip through and score very low (00 or 20). I just have no idea how that is happening (which is why I should start enabling token output in the headers and look), but that's why I was thinking of scoring AC_SPAMMY_URI_PATTERNS very high if Bayes is scoring very low, although I guess that kind of defeats the purpose of Bayes and introduces the risk of FPs. seems obvious that learning manually a week later isn't doing the trick imo, you're in need of a better method to autolearn in "the flow" as use an imap folder to drop FNs into and script learn spam from there, every hour, for example... Axb
Re: Increase in Image Spam
Amir Caspi wrote: > Bayes is set to autolearn, and I manually run sa-learn about once a week on > my spam folder (to learn the FNs, plus lower-scoring spam that was not > autolearned). Try setting up a cron job to run this daily or even as often as hourly. The faster you get feedback into the system the less likely it is you'll end up with strange results. Have you tried learning one specific FN, then reprocessing that message to see what Bayes score it gets? IME it will usually shift from BAYES_00 to at least BAYES_40 in most cases, even with a large sitewide DB with far more tokens than the usual per-user DB. -kgd
Re: Increase in Image Spam
On 2014-02-20 18:06, Amir Caspi wrote: for whatever reason, many of the FNs I've been getting lately are passing because they hit BAYES_00, even though they are matching AC_SPAMMY_URI_PATTERNS. I need to enable bayes tokens in the headers so I can see why these are considered so hammy when I know for sure they're not... meta AC_URI_BAYES_HAM (AC_SPAMMY_URI_PATTERNS && BAYES_00) score with 5 ? But, I would love if there were a way to ignore the bayes score if AC_SPAMMY_URI_PATTERNS matches. see above, dont count on scores, make rules to add scores, for the spam that is really spam I know this is rather silly -- the whole point of Bayes is to help determine if an email is spam or ham regardless of the other rules -- but I'm just flummoxed by having these obviously-spammy emails being treated as ham. you should really just train bayes more then, spammers will always loose if bayes is well trained Should I create a rule that adds extra points if AC_SPAMMY_URI_PATTERNS hits AND a low Bayes score is found? yep as i showed on above Or should I just make AC_SPAMMY_URI_PATTERNS a poison pill, since I've never gotten an FP out of it? this will work aswell but if bayes is trained to bayes_60 or highter is does not really ned more help on bayes scoreing Not sure what else to do about these Bayes-killing spams (besides wiping my entire Bayes DB and starting over). this will be counter productive :=) Thoughts? samples somewhere ?`
Re: Increase in Image Spam
On Feb 20, 2014, at 10:34 AM, Axb wrote: > I hope you're running SA 3.4 so: I am still on 3.3.2 because nobody has yet packaged 3.4 for CentOS 5.x, from what I can tell. I have the package from the rpmforge-extras repo, and 3.3.2 is still the most current version there (and on Atomic and AtRPMs). I'm not sure who is responsible for updating the packages, but I'll probably have to wait a while until they get 3.4 uploaded there. > Assuming you can check maillogs and can either detect some spammed unknown > user patterns or have a dedicated trap domain to spare, I'd accept that mail > and write some header rules to score the trap rcpt/domain REAL high and use a > rule like > > tflags RULENAME autolearn_force I'm not entirely sure what you mean here. Are you saying to use a honeypot/spamtrap to feed the Bayes DB? My problem is not that my Bayes DB doesn't have enough spam in it, it's that these particular FNs are scoring 00. Let me note that the Bayes DBs are per-user, not per-domain. Here's the magic output from my Bayes DB: 0.000 0 3 0 non-token data: bayes db version 0.000 0 239650 0 non-token data: nspam 0.000 0 85695 0 non-token data: nham 0.000 0 145773 0 non-token data: ntokens 0.000 0 1387110367 0 non-token data: oldest atime 0.000 0 1392917375 0 non-token data: newest atime 0.000 0 1392886526 0 non-token data: last journal sync atime 0.000 0 1392637273 0 non-token data: last expiry atime 0.000 05529600 0 non-token data: last expire atime delta 0.000 0 9005 0 non-token data: last expire reduction count I don't think this counts as a "small" DB, does it? Bayes is set to autolearn, and I manually run sa-learn about once a week on my spam folder (to learn the FNs, plus lower-scoring spam that was not autolearned). MANY such image spams are caught properly, including by Bayes; the problem is that some of them, somehow, manage to slip through and score very low (00 or 20). I just have no idea how that is happening (which is why I should start enabling token output in the headers and look), but that's why I was thinking of scoring AC_SPAMMY_URI_PATTERNS very high if Bayes is scoring very low, although I guess that kind of defeats the purpose of Bayes and introduces the risk of FPs. -- Amir
Re: Increase in Image Spam
On 02/20/2014 06:22 PM, Amir Caspi wrote: On Feb 20, 2014, at 10:15 AM, Axb wrote: What kind of traffic are you dealing with? personal, corporate? ISPish? How many domains/users/msgs/day? This is mostly personal email with a little bit of corporate. In this instance, it is for a single domain with 3 users and approximately 50-100 total legitimate messages per day (but HUNDREDS of spams per day, most of which are properly classified; I am seeing only a few [<10] FNs per day, although those FNs are, as I described, getting Bayes_00... they are almost always image spam with not much text.) I do have a number of other domains but I don't monitor the spam quality on those actively (and I haven't received complaints). In your case this is what I'd do. I hope you're running SA 3.4 so: Assuming you can check maillogs and can either detect some spammed unknown user patterns or have a dedicated trap domain to spare, I'd accept that mail and write some header rules to score the trap rcpt/domain REAL high and use a rule like tflags RULENAME autolearn_force obviously you'll need bayes_auto_learn 1 That would help feed your small Bayes DB pretty fast and help detect all kinds of crap. h2h
Re: Increase in Image Spam
On Feb 20, 2014, at 10:15 AM, Axb wrote: > What kind of traffic are you dealing with? personal, corporate? ISPish? > How many domains/users/msgs/day? This is mostly personal email with a little bit of corporate. In this instance, it is for a single domain with 3 users and approximately 50-100 total legitimate messages per day (but HUNDREDS of spams per day, most of which are properly classified; I am seeing only a few [<10] FNs per day, although those FNs are, as I described, getting Bayes_00... they are almost always image spam with not much text.) I do have a number of other domains but I don't monitor the spam quality on those actively (and I haven't received complaints). Thanks. --- Amir
Re: Increase in Image Spam
On 02/20/2014 06:06 PM, Amir Caspi wrote: Hi all, Following some off-list discussions with Kevin, John, et al., I had a question that was suggested I bring up on-list, so here it is: For whatever reason, many of the FNs I've been getting lately are passing because they hit BAYES_00, even though they are matching AC_SPAMMY_URI_PATTERNS. I need to enable bayes tokens in the headers so I can see why these are considered so hammy when I know for sure they're not... But, I would love if there were a way to ignore the bayes score if AC_SPAMMY_URI_PATTERNS matches. I know this is rather silly -- the whole point of Bayes is to help determine if an email is spam or ham regardless of the other rules -- but I'm just flummoxed by having these obviously-spammy emails being treated as ham. Should I create a rule that adds extra points if AC_SPAMMY_URI_PATTERNS hits AND a low Bayes score is found? Or should I just make AC_SPAMMY_URI_PATTERNS a poison pill, since I've never gotten an FP out of it? Not sure what else to do about these Bayes-killing spams (besides wiping my entire Bayes DB and starting over). Thoughts? Amir, What kind of traffic are you dealing with? personal, corporate? ISPish? How many domains/users/msgs/day? There's a number of options depending on the amount of traffic you handle.
Re: Increase in Image Spam
Hi all, Following some off-list discussions with Kevin, John, et al., I had a question that was suggested I bring up on-list, so here it is: For whatever reason, many of the FNs I've been getting lately are passing because they hit BAYES_00, even though they are matching AC_SPAMMY_URI_PATTERNS. I need to enable bayes tokens in the headers so I can see why these are considered so hammy when I know for sure they're not... But, I would love if there were a way to ignore the bayes score if AC_SPAMMY_URI_PATTERNS matches. I know this is rather silly -- the whole point of Bayes is to help determine if an email is spam or ham regardless of the other rules -- but I'm just flummoxed by having these obviously-spammy emails being treated as ham. Should I create a rule that adds extra points if AC_SPAMMY_URI_PATTERNS hits AND a low Bayes score is found? Or should I just make AC_SPAMMY_URI_PATTERNS a poison pill, since I've never gotten an FP out of it? Not sure what else to do about these Bayes-killing spams (besides wiping my entire Bayes DB and starting over). Thoughts? Thanks. --- Amir
Re: Increase in Image Spam
On 2014-02-11 20:59, RW wrote: Actually I find BAYES_99 to be so reliable that I'd be happy to score it above 5.0. Other have made similar comments too. there is a number of ways to punish spf pass domains for spamming :) blacklist_from *@foo.example.org and for the bayes on could make another meta like: meta NOT_BAYES_HAM_SPF_PASS (!BAYES_00 && SPF_PASS) or simple reject sender domain in mta
Re: Increase in Image Spam
On Tue, 11 Feb 2014 20:22:00 +0100 Benny Pedersen wrote: > On 2014-02-11 18:25, Andy Jezierski wrote: > > > They don't really hit on any rules > > > > X-Spam-Status: No, score=3.5 required=5.0 > > tests=BAYES_99,HTML_MESSAGE, > > > > SPF_HELO_PASS,SPF_PASS autolearn=no autolearn_force=no > > version=3.4.0-rc5 > > bayes is seeing it as spam, so it might be in vain :) > > well if bayes is well trained you can add more meta score to that > hit, but also maybe meta it with not user in spf whitelist or > something ? Actually I find BAYES_99 to be so reliable that I'd be happy to score it above 5.0. Other have made similar comments too.
Re: Increase in Image Spam
On 2014-02-11 18:25, Andy Jezierski wrote: They don't really hit on any rules X-Spam-Status: No, score=3.5 required=5.0 tests=BAYES_99,HTML_MESSAGE, SPF_HELO_PASS,SPF_PASS autolearn=no autolearn_force=no version=3.4.0-rc5 bayes is seeing it as spam, so it might be in vain :) well if bayes is well trained you can add more meta score to that hit, but also maybe meta it with not user in spf whitelist or something ? eg if spf pass domain is spamming remove it from local.cf as whitelisted for that envelope sender, not From: header meta UNTRUSTED_SPF_PASS (SPF_PASS && !USER_IN_SPF_WHITELIST) score based on that meta to distingt that this is usefull add whitelist_from_spf *@foo.example.com to local.cf for sender domains that is not spaming same meta can be made with dkim
Re: Increase in Image Spam
On 2/11/2014 2:02 PM, John Hardin wrote: On Tue, 11 Feb 2014, Amir Caspi wrote: I could release the rules publicly but that may end up backfiring, per above. John, Kevin, what do you guys think? Spammers can install SpamAssassin as easily as anyone else, that's a known risk. Any rules we provide they can potentially test against their spams to minimize score. How much they actually *do* this I can't say. We could try it with one of your rules, and if it suddenly stops hitting then the spammers are reacting. I think it has value, even if they do react. I agree with John's assessment.
Re: Increase in Image Spam
On Tue, 11 Feb 2014, Amir Caspi wrote: I could release the rules publicly but that may end up backfiring, per above. John, Kevin, what do you guys think? Spammers can install SpamAssassin as easily as anyone else, that's a known risk. Any rules we provide they can potentially test against their spams to minimize score. How much they actually *do* this I can't say. We could try it with one of your rules, and if it suddenly stops hitting then the spammers are reacting. I think it has value, even if they do react. -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- Windows Genuine Advantage (WGA) means that now you use your computer at the sufferance of Microsoft Corporation. They can kill it remotely without your consent at any time for any reason; it also shuts down in sympathy when the servers at Microsoft crash. --- Tomorrow: Abraham Lincoln's and Charles Darwin's 205th Birthdays
Re: Increase in Image Spam
On Feb 11, 2014, at 10:25 AM, Andy Jezierski wrote: > They don't really hit on any rules A number of image spams have certain template formats and I've written custom rules to catch many... however, I've been hesitant to release those rules publicly since spammers could just change their templates easily to circumvent this. (Most image spams for me hit moderate or very low Bayes scores, sometimes Bayes_00, presumably due to the low amount of spammy tokens and large amount of innocuous/hammy tokens...) I could release the rules publicly but that may end up backfiring, per above. John, Kevin, what do you guys think? --- Amir
Increase in Image Spam
I've been seeing a pretty big increase in image spam over the last month or so. I remember using FuzzyOCR years ago when image spam was a much bigger problem. Since FuzzyOCR hasn't been maintained in several years, is there an alternative that would work? Or is there another way to try and catch them? They don't really hit on any rules X-Spam-Status: No, score=3.5 required=5.0 tests=BAYES_99,HTML_MESSAGE, SPF_HELO_PASS,SPF_PASS autolearn=no autolearn_force=no version=3.4.0-rc5 Thanks Andy
Re: Increase in image spam
On 6-Feb-2007, at 09:30, Sujit Choudhury wrote: Lately there has been an increase in image spam. We are using imageinfo.cf with ImageInfo plugin. However, this is not making a lot of difference. We are also using virtually all the SARE rules plus using sa-update and restarting spamd everyday. I added ImageInfo yesterday and it's hit only on very high scoring spams so far We are unwilling to use FuzzyOCR, which I understand very CPU intensive. Does any body has got any suggestion. Not really, if you aren't able to eliminated them without looking at the image your choises are pretty limited: live with them or run FuzzyOCR. You might try running it over the most egregious accounts and see how much of a hit it really imposes. By the way, we use RBL to get rid of 80% spam at SMTP time, and spamd is run only on the rest of the mail that is coming through. Have you looked into greylisting? -- Do you believe that there's someone up above, and does he have a timetable directing acts of love?