Re: Dropping mail
Dianne Skoll wrote: On Fri, 27 Apr 2018 14:39:43 -0500 (CDT) David B Funk wrote: [snip] Define two classes of recipients: class A == all users who want everything class B == all users who want "standard" filtering This works if you have a limited number of classes, but in some cases users can make their own rules and settings so the number of classes can be the same as the number of RCPTs. --- Except users who have their own rules are not likely doing it in the context of the initial choice of whether or not to accept the email onto the server. I.e. they'll run some anti-spam filter in their "account" context as a normal user. The users who want "standard filtering", may have it done such that the email is never accepted onto their email server. I.e. it "should" never be the case that user-defined filters are run in the MTA's initial receive context as the MTA just received (or is in process of receiving) an email coming in on a privileged port (like port 25) which would imply a privileged context (most likely root). Even in the two-class case, there's still a delay for the subsequent class(es). --- Delays are not the same as dropped email. People use grey-listing which often causes some delay, but in this case, I've seen examples of people who's mail-provider was inspecting+filtering emails for spam+malware also have problems in delivery time (60-90 minutes after the fact). So it is already the case that mail-providers who do filtering on the mail-server are sometimes slow to pass on the email to their users, depending on their volume). linda
Re: how to enable autolearn?
Marc Stürmer wrote: Am 2017-01-09 22:30, schrieb L A Walsh: I have: bayes_auto_learn_threshold_nonspam -5.0 bayes_auto_learn_threshold_spam 10.0 In order for autolearn to work you need at least 200 trained messages in the ham and spam category. If the filter doesn't know enough mails yet it will state it in the log file. Have you trained your Bayes filter accordingly or just enabled it and expect it to start autolearning out of the box? My corpus is regularly pruned, but still I have daily junk-mail logs going back to 2014 (~776 files over the 3 years, where files contain a day's spam). I did have junk going back much farther until I decided it was a bit too much and a bit to dated.. Also take a look at: https://wiki.apache.org/spamassassin/AutolearningNotWorking I'm not terribly worried, since every night all the junk messages get fed in as spam. Anything I catch as non-spam gets tossed in the auto-despam folder, but those are only a few a week. Quote: "Finally, SpamAssassin requires at least 3 points from the header and 3 points from the body, to auto-learn as spam. If either section contributes fewer points, the message will not be auto-learned." I guess the latter might be just the case with your installation. † --- ??? How so? Though the past few days been getting many spam mails that are mostly header, no to little body + an attachment ...
Re: how to enable autolearn?
John Hardin wrote: On Mon, 9 Jan 2017, L A Walsh wrote: I have: bayes_auto_learn_threshold_nonspam -5.0 bayes_auto_learn_threshold_spam 10.0 in my user_prefs. When I get a message though, I see autolearn being set to 'no': X-Spam-Status: Yes, score=18.7 req=4.8..autolearn=no autolearn_force=no Shouldn't a score of 18.7 trigger an autolearn? Not all rules contribute to the score used for the autolearn decision. Particularly, the BAYES rules don't contribute to the autolearning decision in order to avoid positive feedback loops. That's why my "bayes_auto_learn" thresholds were fairly high. So why is it called bayes_auto_learn_threshold if it isn't used for auto-learning? Isn't that a bit confusing?
Re: How to report spam to mailspike
Dave Warren wrote: On 2014-08-29 02:38, Marcin Mirosław wrote: So what should I do in your opinion? I'm getting spam to my private spamtrap so I can't fill fields about company - it doesn't matter where I'm hired for reporting spam. What if I would be unemployed? Then I would have to lie about company? IMHO it is the way to hinder sending complaints from users. If you're not willing --- I think perception may be "am not able"... ? to provide the information they request, and they won't accept an inquiry without it, then you're left with a different choice: 1) Do nothing, 2) Cease using the service. From their perspective, either the policy will ... --- If they really mean company then it helps them target companies for their own advertising. If I'm acting on my own behalf, I'd put "Personal" or "None" or "N/A" into a form, and if it's not accepted, oh well. --- Ditto on this... Company "Self" has been in business for decades! ;-) "They" are definitely a "Service provider"... (think of all the things 'self' does for you!) ;-) Corporation was a way of "embodying" a business practice to give it human rights... but you are already "embodied", thus incorporated (no offense to the non-corporeal beings reading this list). I'm sure you govern yourself as well if you want to get technical, so if they want to be technical, so can others... Then again, are they worth the bother?
Re: Advice sought on how to convince irresponsible Megapath ISP.
Karsten Br�� wrote: Similarly, your scripts do not reject messages, but choose not to fetch them. === No... fetchmail fetches them, "sendmail" rejects them because they don't have a resolvable domain. My sorting and spamassassin scripts get called after the email makes it through sendmail. My scripts don't see the email. Pragmatic solution: If you insist on your scripts to not fetch those spam messages (which have been accepted by the MX, mind you), automate the "manual download and delete stage", which frankly only exists due to your choice of not downloading them in the first place. Make your scripts delete, instead of skipping over them. 'fetchmail', that I know of, isn't able to tell if a sending domain is invalid until it has fetched the email (that I know of). fetchmail tries to send the email to me via sendmail, which doesn't accept the email because it is invalid. Unfortunately, my ISP doesn't use sendmail or it would reject such emails by default. Be liberal in what you accept, strict in what you send. In particular, later stages simply must not be less liberal than early stages. In this case, I don't even want the invalid email passed on to me. I don't want to accept spam. The first defense is to have the MX reject non-conforming email. Your MX has accepted the message. My ISP's MX has accepted it, because it doesn't do domain checking. My machine's MX rejects it so fetchmail keeps trying to deliver it. While I *could* figure out how to hack sendmail to not reject the message, my preference would be to get the ISP to act responsibly and reject emails without a valid return domainname. It's standard in sendmail, rejection of such email is called for in the RFC's. The choice to not follow RFC's allows spam that would normally be rejected, through to my system which does follow the standards and rejects it -- so it stays in the "download queue" for my domain. At that point, there is absolutely no way to not accept, reject it later. You can classify, which you use SA for (I guess, given you posting here). You can filter or even delete based on classification, or other criteria. The MX shouldn't accept it based on RFC standards. When I asked for it to be blocked, I was first asked for the name of the "offending domain" and told I could blacklist the domain by adding it to a list with their web-client. I asked for a scriptable way to do this after a domain lookup, they said they no longer offer scripted solutions as the ISP I signed up with (who they bought) did. The only response my ISP will give is to turn on their spam filtering. I tried that. In about a 2 hour time frame, over 400 messages were blocked as spam. Of those less than 10 were actually spam, the rest were from various lists. So having them censoring my incoming mail isn't gonna work, but neither will the reject the obvious invalid domain email. I can't believe that they insist on forwarding SPAM to their users even though they know it is invalid and is spam. There is no censoring. When I complained about the problem I found that "recommended filter rules" had been activated on my account. In the couple of days they were active about 80% of the messages they caught were not spam -- and some of the bad domains still got passed through. There is no forwarding. It comes in their MX, and is forwarded to their users. Any ideas on how to get a cheapo-doesn't want to support anything ISP to start blocking all the garbage the pass on? Change ISP. You decided for them to run your MX. I didn't decide for them, I inherited them when they bought out the competition to supply lower quality service for the same price. It is your choice to aim for a cheapo service (your words). It wasn't when I signed up. Cost $100 extra/month. Now only $30 extra/month that I don't host the domain with them. If you're unhappy with the service, take your business elsewhere. Better service doesn't necessarily mean more expensive, but you might need to shell out a few bucks for the service you want. I already am... my ISP (cable company) doesn't have the services I want for mail hosting. I went to another company for that, who was bought out some times ago, with the new company dropping quality as time goes on. In this case, I wanted to try to push back against them accepting the illegal (not to spec) spam and forwarding it to their customers in the first place. There are many "compromised" solutions that are available. Certainly such choices are not my first, which was why I posted here to see if anyone else had any experience with getting an irresponsible ISP to reject non-compliant email, and barring that, maybe getting offered better choices from the experience of the people on this list. Cheers! '^/
Re: SA 3.3.2 buggie? -- message that DB file doesn't exist -- but systrace shows successful lock and open!
Michael Scheidell wrote: On 1/16/12 9:36 AM, Linda Walsh wrote: This is not permission problem -- Message I get: have you tried to upgrade to the released version? 3.3.2? 3.0.2 was obsolete 6 years ago. --- Well, I could pretend like you wouldn't have guessed it was a typo and tell you that you were right, and that after installing 3.3.2, I had the exact same bug... but, I'll just mention that it is 3.3.2... sa-learn --version SpamAssassin version 3.3.2 That is 'still' having this problem (though I will note that most of the references on this error message date back to the early 3.0 series... So this bug has been out there for over 6 years (according to your timeline...) That's a long time to ignore a widely found bug (by googling on it) -- the only offering of a solution was to check permissions -- which I verified in the trace, -- as not being the problem -- but that the message being issued by SA was bogus.
SA 3.0.2 buggie? -- message that DB file doesn't exist -- but systrace shows successful lock and open!
This is not permission problem -- Message I get: bayes: cannot open bayes databases /home/lw_spam/.spamassassin/bayes_* R/O: tie failed: bayes: cannot open bayes databases /home/lw_spam/.spamassassin/bayes_* R/W: tie failed: No such file or directory --- Except I followed it through using strace. Both are being opened and the 2nd is even successfully being LOCKED: Jan 16 06:17:34.806 [20156] dbg: locker: safe_lock: trying to get lock on /home/lw_spam/.spamassassin/bayes with 0 retries link("/home/lw_spam/.spamassassin/bayes.lock.Ishtar.sc.tlinx.org.20156", "/home/lw_spam/.spamassassin/bayes.lock") = 0 Jan 16 06:17:34.806 [20156] dbg: locker: safe_lock: link to /home/lw_spam/.spamassassin/bayes.lock: link ok before it is opened... then SA turns around and claims it can't find them... So why is SA opening the files, but then writing out a completely BOGUS and false messge that it couldn't open them or even find them?!?!... Whatever the problem is -- a better error message that isn't LYING would be a good thing at this point, since in searching on the web, I see alot of people getting this -- and it's often blamed on their permissions... but now, everyone should know that permissions are not it...the message is completely bogus... it can open them just fine -- something eles may be wrong, but the message is very misleading More complete log follows... (deleted all the lines that had stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2819, ...}) = 0 in them -- they were 2/3rd the debug messages... -l - Jan 16 06:17:34.798 [20156] dbg: bayes: tie-ing to DB file R/O /home/lw_spam/.spamassassin/bayes_toks stat("/home/lw_spam/.spamassassin/bayes_toks", {st_mode=S_IFREG|0777, st_size=5177344, ...}) = 0 open("/home/lw_spam/.spamassassin/bayes_toks", O_RDONLY) = 3 bayes: cannot open bayes databases /home/lw_spam/.spamassassin/bayes_* R/O: tie failed: Jan 16 06:17:34.799 [20156] dbg: bayes: untie-ing DB file toks Jan 16 06:17:34.799 [20156] dbg: config: score set 1 chosen. Jan 16 06:17:34.800 [20156] dbg: sa-learn: spamtest initialized Jan 16 06:17:34.800 [20156] dbg: learn: initializing learner Jan 16 06:17:34.800 [20156] dbg: plugin: Mail::SpamAssassin::Plugin::Bayes=HASH(0x286d1f8) implements 'learner_sync', priority 0 Jan 16 06:17:34.801 [20156] dbg: bayes: bayes journal sync starting stat("/home/lw_spam/.spamassassin", {st_mode=S_IFDIR|S_ISGID|0777, st_size=4096, ...}) = 0 stat("/home/lw_spam/.spamassassin", {st_mode=S_IFDIR|S_ISGID|0777, st_size=4096, ...}) = 0 stat("/home/lw_spam/.spamassassin", {st_mode=S_IFDIR|S_ISGID|0777, st_size=4096, ...}) = 0 stat("/home/lw_spam/.spamassassin/bayes_journal", 0xe3e138) = -1 ENOENT (No such file or directory) Jan 16 06:17:34.801 [20156] dbg: bayes: bayes journal sync completed Jan 16 06:17:34.802 [20156] dbg: plugin: Mail::SpamAssassin::Plugin::Bayes=HASH(0x286d1f8) implements 'learner_expire_old_training', priority 0 Jan 16 06:17:34.802 [20156] dbg: bayes: expiry starting stat("/home/lw_spam/.spamassassin/bayes_toks", {st_mode=S_IFREG|0777, st_size=5177344, ...}) = 0 stat("/home/lw_spam/.spamassassin", {st_mode=S_IFDIR|S_ISGID|0777, st_size=4096, ...}) = 0 Jan 16 06:17:34.803 [20156] dbg: locker: mode is 384 stat("/etc/resolv.conf", {st_mode=S_IFREG|0644, st_size=834, ...}) = 0 open("/etc/resolv.conf", O_RDONLY) = 3 open("/etc/host.conf", O_RDONLY)= 3 open("/etc/hosts", O_RDONLY|O_CLOEXEC) = 3 open("/home/lw_spam/.spamassassin/bayes.lock.Ishtar.sc.tlinx.org.20156", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3 Jan 16 06:17:34.805 [20156] dbg: locker: safe_lock: created /home/lw_spam/.spamassassin/bayes.lock.Ishtar.sc.tlinx.org.20156 Jan 16 06:17:34.806 [20156] dbg: locker: safe_lock: trying to get lock on /home/lw_spam/.spamassassin/bayes with 0 retries link("/home/lw_spam/.spamassassin/bayes.lock.Ishtar.sc.tlinx.org.20156", "/home/lw_spam/.spamassassin/bayes.lock") = 0 Jan 16 06:17:34.806 [20156] dbg: locker: safe_lock: link to /home/lw_spam/.spamassassin/bayes.lock: link ok unlink("/home/lw_spam/.spamassassin/bayes.lock.Ishtar.sc.tlinx.org.20156") = 0 lstat("/home/lw_spam/.spamassassin/bayes.lock", {st_mode=S_IFREG|0660, st_size=26, ...}) = 0 Jan 16 06:17:34.807 [20156] dbg: bayes: tie-ing to DB file R/W /home/lw_spam/.spamassassin/bayes_toks open("/home/lw_spam/.spamassassin/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3 stat("/home/lw_spam/.spamassassin/__db.bayes_toks", 0xe3e138) = -1 ENOENT (No such file or directory) stat("/home/lw_spam/.spamassassin/bayes_toks", {st_mode=S_IFREG|0777, st_size=5177344, ...}) = 0 open("/home/lw_spam/.spamassassin/bayes_toks", O_RDWR) = 3 Jan 16 06:17:34.808 [20156] dbg: bayes: untie-ing DB file toks open("/home/lw_spam/.spamassassin/bayes.lock.Ishtar.sc.tlinx.org.20156", O_WRONLY|O_CREAT|O_EXCL, 0700) = 3 unlink("/home/lw_spam/.spamassassin/bayes.lock.Ishtar.sc.tlinx.org.20156") = 0 lstat("/home/lw_spam/.spamassassin/bayes.lock", {st_mode
BUGs? (Re: Upgraded to new spamassassin(3.3.2(, and now it won't work (no rules.....ran sa-update, nada...)_
Linda Walsh wrote: Sorry, included that in my subject I did run sa-update, all it says (put it in verbose mode) is that the rules are up to date. Initially it did download the rules into /var/lib/spamassassin//. Those files are still there, but spamd is, apparently, not seeing them. --- I finally solved this by copying all the rules from /etc/mail/spamassassin into /usr/share/spamassassin...and then it was all happy. Of course this would seem to break multiple rules about where to keep the various rule sets...but hey, maybe spamd didn't get the memo about where it was supposed to look... I dunno... It IS looking in /var/lib/spamassassin, but that was failing, because it couldn't find 'Mail::SpamAssassin/CompiledRegexps/body_0.pm'. It's the file that includes all the compiled expressions. The path structure under /var/lib/spamassassin is a bit confused/confusing. That's what cased that error. I.e. dir struct under /var/lib/spamassassin looks like: Ishtar:/var/lib/spamassassin> tree -FNhsR --filelimit 7 . ├── [ 71] 3.003002/ │ ├── [4.0K] updates_spamassassin_org/ [61 entries exceeds filelimit, not opening dir] │ └── [2.5K] updates_spamassassin_org.cf ├── [ 18] compiled/ │ └── [ 21] 5.012/ │ └── [ 50] 3.003002/ ## under the 3.003002 dir under compiled is it gets interesting ##two trees for Mail/SA/CompiledRegexps, one rooted here, ## but the other 'down' a level under 'auto' ## (where the real binaries are. │ ├── [ 17] auto/ │ │ └── [ 25] Mail/ │ │ └── [ 28] SpamAssassin/ │ │ └── [ 35] CompiledRegexps/ │ │ ├── [ 54] body_0/ │ │ │ ├── [ 0] body_0.bs │ │ │ └── [1.3M] body_0.so* │ │ └── [ 51K] body_0.pm #was missing │ ├── [237K] bases_body_0.pl │ └── [ 25] Mail/ │ └── [ 28] SpamAssassin/ │ └── [ 22] CompiledRegexps/ │ └── [ 51K] body_0.pm #copied to above ├── [2.6K] user_prefs* └── [2.6K] user_prefs.template* 13 directories, 8 files As for why it didn't find rules in /etc/mail/SA (it DID read /etc/mail/SA, just didn't regard anything there as a rule)...so duping those files into /usr/share/SA, made them magically become 'rules' I assume, of course, this is correct-by-design behavior? ;-) (*cough*)
Re: Upgraded to new spamassassin(3.3.2(, and now it won't work (no rules.....ran sa-update, nada...
Sorry, included that in my subject I did run sa-update, all it says (put it in verbose mode) is that the rules are up to date. Initially it did download the rules into /var/lib/spamassassin//. Those files are still there, but spamd is, apparently, not seeing them. Martin Gregorie wrote: Run sa_update. SA packages from 3.x onwards don't include the rule set to avoid installing stale rules. A good install will have added /etc/cron.d/sa-update to your system. It runs a daily update at 04:10, but you can run it manually if it hasn't already picked up a rule set. Martin
Upgraded to new spamassassin(3.3.2(, and now it won't work (no rules.....ran sa-update, nada...
I wanted to try to head off an increasing spam count I'd gotten since I upgraded my suse server to 11.4 ... So I tried cpan to goto 3.3.2, but now...it says .. no rules!...I've tried putting rules in just about every dir I can think of... I had it running as as a daemon before - I thought it ran as spamd, but I could be wrong -- just that there have been user and group spamd on my since I first installed SA a few years ag0, but I'd installed some suse versions since then and they still worked, and I don't know how they ran. The latest incarnation didn't work as well, so went back to try the cpan version...*ouch*... where is it looking for it's rules, in /dev/null?!? I tried running it as root, as spamd, giving it a home dir, in /usr/share/SA, /var/lib/SA, /etc/mail/SA, (not to mention ~/.SA... )... "spamassassin --lint" comes back 'silently'...no errors... so if there are no errors, why does it claim it can't find any rules? Ya'd think SA --lint would notice that as being a problem...? :-( ideas?
Re: HT-perf, paralism, thruput+latncy (dsk, net, RBLs) powr usg/meas, perlMiltring & ISP's reducng spamd latency
Nix wrote: [This is really OT for spamassassin, isn't it? Should we take it off-list?] -- a bit -- and somewhat not. Much of it boils down to speed. How to best do it, parallelism, new hardware features...lowering latency...etc. I'd really hoped to speed up my SA processing -- at least it can handle a sizable concurrent load now, that's an improvement. Need to figure out a way to cache or speedup the network requests -- I'm sure it's mostly latency on the spamd servers i'm checking w/. The highest my download speed went was about 500K (on a 768K DSL)...it's all in packet latency that's the prob. On 8 Aug 2009, Linda Walsh spake thusly: OK, you've out-RAIDed me. It's a server. Mostly unraided...sorta...4 of them are in 2 VD's in mirror mode. the system disk is a 15K SAS, but only 70G space. The rest are what RAID are supposed to be -- Redundant Arrays of _Inexpensive_ disks (SATA). Boy was Dell pissed. They really don't like selling bare-bones systems. I had to buy the disk trays elsewhere (dell won't sell them separate from a disk). Only 1 VD is a real RAID(5) with a whopping 3 disks...ooooo.. 2 disks are sitting around as spares until I can figure out how to add them to existing arrays (supposed to be 'easy' and the controller rebuilds -- but noooI'm just spending too much time on computer solving mail-filter problems while forcing myself uptospeed w/perl5.10's new features, CSS, and fonts again (I just hosed my desktop's fonts, so need to reboot...oops). I'd also prefer the my own *choice* of whether or not to use the on-disk cache ... Maybe some of this control will get into the or does the bios have to support everything? Well, you'll never get the option to turn off the Linux kernel's disk cache, --- On-disk cache = cache 16-32MB of cache on the disk. It really speeds up writes when you are writing small junks as it can coalesce the writes to physical positions on disk -- while the kernel only uses a generic 'model' all disks. The real internal geometry is completely hidden these days -- you can see it talked about on Tom's HW occasionally when they bench a disk. You see fast constant speeds at track 0 (outside of disk), then you see multiple 'drops' as the sectors/track shrink due to lower diameter. But the on disk cache -- all the kernel developers dis it because they run unstable kernels that can leave up to 32MB in a write buffer on a disk if it gets reset or loses power before it finishes flushing it's cache. But on a system on a UPS, not running test-kernels all the time, unplanned shutdowns are rare, so the speed-up is worth it. Just like the RAID controller itself has it's own battery backed up (non-extensible) RAM (it doesn't know about UPS's and such -- my previous server lasted 9 years...I feel in large part due to it being on a power-conditioned UPS (APC SmartUPS that supposedly puts out a sinewave, despite my flakey PG&E power). fast speeds -- then because executables and shared libraries run out of it, --- I'm more worried about large write speeds. There, circumventing the system cache and using direct i/o can get you faster throughput when doing disk-to-disk copies -- the limiting factor is the target disk write rate, and no kernel cache will help. What does help is overlapping reads from one device and doing writes to the other device that fit in it's buffer. Then you can theoretically get _closer_, but not quite, double the throughput (as writes are slower). But if you write in, say 24M chunks to a 34MB on-disk (no RAID) buffer, it can often get the data out while you are reading the next 24MB from the 1st disk. If you go through the kernel's system cache, it throws everything off -- you can watch it -- the kernel will give priority to reads (as it should, as reads usually block the CPU or the USER from getting things done. While writes can *usually* be done lazy in the background. But on D:D copies of large multi-gig files, you want write and read to be exactly balanced for optimal throughput. But that's a *special* case, when you are moving large data files around (for example a 157G full backup -- and that's gzipped, because bzip2 (lzma is much worse, but way good at compression) uses too much cpu. On my old server (which died after 9 years (started with a p...@400mhz, ended with dual-P-III's at 1GHz, but 256K cache each...hardly better than a celeron!) bzip2 would slow down backup writing to disk to about 600K/s! gzip only cut speed by about half (from 20MB/s for raw data to 10MB/s). Compressed backups are nice, BUT, when you need to access them -- if you need to unpack a 100+G level zero...ouch... just to uncompress it would take hours!. So ... while my new server is relatively fast -- I sorta earned it --
Re: OT: Nehelam's New HT ability....
Per Jessen wrote: But how about the core subject here - the hyperthreading? Have you noticed anything very different wrt that? I haven't, but it will certainly depend on your workload. Definitely will depend on workload. But I noticed more power consumption and it seemed to handle more real work in the new HT's, but I haven't had the sys long enough to do alot of benching. Major noticeable diff in sys load and fan's kicking in in running 4 cpu intensive processes vs. 8 (used multiple copies of sh-keygen -b 16384) to keep it busy - 8192 finished too quickly..~10-15 secs). Sigh
Re: OT: Nehelam's New HT ability.... and ability to handle spamd high load (preheating cache?)
My bios doesn't allow shutting off HT, but does allow turning off 2 or 3 cores (allowing dual or single) -- I'd rather see that type of feature at runtime - allowing system load to decide whether to activate another core -- though the diff on my 2.6GHZ in power consumption when from about 157 watts (according to its front panel), to over 260 when I loaded all 8 'virtual' cores (only 4 corex2HT's/core). That's w/8 hard disks inside (though not under load...just spinning). Seems to be no way on my machine (Dell is so limiting sometimes), to turn off unused hard drives, or only spin them up when I want to use them -- Some are hot-spare or just unconfig'ed, yet they spinup. I'd also prefer the my own *choice* of whether or not to use the on-disk cache as well as the raid controller's cache. I virtually never have unplanned shutdowns -- (its on a UPS that will run for >1hour under its load). Maybe some of this control will get into the lk -- or does the bios have to support everything? Supposedly it has temp and electrical monitoring 'galore', but I can' even read the DIMM temps. I went with the 'eco' power supplies at 570W (vs. 870). But got the dual power supply backup -- I think, from what I an measure, it splits the power usage between the supplies unless one goes out. That could mean I really have a 1140W available? Dunno. Not sure exactly what 'spare' means -- if it limits total consumption to level of 1 supply even though it splits the load (power meter hooked to one and watched it go to half load when other was plugged in). BTW, I'm running at 1333MHZ, so maybe it's a heat dissipation prob and not power? I'm only pulling 157-160 to a max of 260 (didn't have disks churning though -- was just running copies of ssh-keygen -b 16384 -- that seems to take it a little bit...8192 comes out in about 10 seconds though. :-). Oblig:sa-users -- I may finally have my 'dead -email' restart problem solved. Before, if I had a large queue, I had to stop fetchmail, often -- download only 10-20 at a time so it's emails wouldn't overload my sendmail queue (it gets backed up on spamassassin). My minimum time for SA (w/network tests) is around 3seconds. But during heavy loads it can really go high -- and my machine can just run out of memory and process space. (part of it is sendmail looking up hosts of received email and bind starting 'cold' (no cache). But started with 2700 emails, ... after # processes got to about 900, I chickened a bit and paused the fetchmail until they dropped under 400 (note, 'load' never went over '2' the whole time, so it was mostly network wait time). But after the initial clear I had about 2200 emails left and just let it run. At that point, I could see it keeping up -- bind's cache was alot warmer now, so not as much network traffic. I added the 'delay time' taken by spamd when running my email inputs (its' actually my filter delay time, but the max diff between the two is about .01 seconds, so it's mostly spamd delay -- my stats for today from ~9:30am are: (n=#emails) n=4513, min=3.27s, max=208.09s, ave=35.16s, mean=27.43s I suppose for RBL's, some of those results are cached in bind as well? I wonder if there's anyway to speed up priming the cache before downloading a bunch of emails (not that I'm off line for that long usually) -- but it's sorta too bad bind doesn't save it's DB on disk on a shutdown, and read it back in after a reboot -- and then expire if needed... Nix wrote: On 1 Aug 2009, Linda Walsh stated: Per Jessen wrote: Not sure about that - AFAICT, it's exactly the same technology. (I haven't done in exhaustive tests though). Supposedly 'Very' different (I hope)... Oh yes. I have a P4 here (2GHz Northwood), and two Nehalems (one 2.6GHz Core i7 with 12Gb RAM and a 2.26GHz L5520 with 24Gb, hello overkill). Compared to the P4s, the Nehalems are *searingly* fast: the performance difference is far higher than I was expecting, and much higher than the clockspeed different would imply. Things the P4 takes half an hour to do, the Nehalems often slam through in a minute or less (!), especially things like compilations that need a lot of cache. Surprisingly, even some non-parallelizable things (like going into a big newsgroup in Gnus) are hugely faster (22 minutes versus 39 seconds: it's a *really* big newsgroup). I suspect the cause is almost entirely the memory interface and cache. The Northwood has, what, 512Kb L2 cache? The Nehalem has 256Kb... but it has 8Mb of shared L3 cache, and an enormously faster memory interface (the FSB is dead, Intel has a decent competitor to HyperTransport at last). I was an AMD fan for years, but the Nehalem has won me back to Intel again. 1) You ca
Re: OT: Nehelam's New HT ability....
Per Jessen wrote: Not sure about that - AFAICT, it's exactly the same technology. (I haven't done in exhaustive tests though). Supposedly 'Very' different (I hope)... 1) You can't turn it off in the BIOS 2) claim of benefit from increased cache (FALSE), (have older 2x2 Dual Core machine with 4MBxL2 Cache/Dual core. If you only use 1 Core/CPU, that 4MB L2 cache/Core) New machine with 1 Quad core (Dual core CPU's are too slow to use memory faster than 800MHz -- only Quad cores go up to Quick Connect Speeds that will support fastest memory of 1333MHz (even if you only have 1 CPU). So you are 'encouraged' to go with Quad over 2x2Dual. Quad has 8MB L3 Cache, w/256K dedicated L2/Core. So with HT 128K/thread. To get 2 Cores, they'll get 256K-L2 ea, + 8MB L3 shared. So about 3.125%more memory! WOW ea!...(though the bandwidth for the fast core processors to main memory can be 2x faster). 3) Here's possible benefit: they've added more parallel resources to each core -- so each thread can possibly get more done than the old threads -- but this is only a maybe depending on workload. The biggest cool thing about Nehelam is power savings -- they implemented Celeron's power-step tech in a big way. Quiescent cores crank down their clocks independently to about 60% of top speed and have efficient sleep states (I think some cores can be halted, but not sure). Some of their processors have a 'turbo mode', which will some small amount faster speed than the speed on the chip label (does that mean the turbo chips are really faster rated chips...you tell me), BUT if fewer cores are used -- say only 2/4, the turbo boost can be a small amount greater (don't have access (don't know if any is published). If one was to go from their marketing graphs (HAHAHAHAHA), Turbo for 4 cores is about 10 more, and if only 2/4 cores are running, it's an additional 10%. So marketing hype/reality, might mean 1-3% faster? I will say this much -- @ idle, w/8 disks (it's a server, so built-in GPU with 8MB shared memory, if you aren't going headless) -- with dual/redundant PS, it uses 157W. (1-PS, slightly more efficient at 146W). Major power savings with possible big increases in speed. But you can't turn off HT as in previous machines (at least not in the one I've had access to). That power consumption is less than half their older Workstation model (though an idle graphics card still sucks quite a bit of useless ergs (stupid Nvidia).. Oblig SA content: When I ran 100 msgs through my filters (that connect to spamd, but that uses net), the MHz immediately jumped from ~1596 up to 2300 on each of the '8' HT cores...so might be perfect for a server that gets sporadic loads! ;-) -linda
Re: Parallelizing Spam Assassin
Well -- it's not just the cores -- what was the usage of the cores that were being used? were 3 out the 8 'pegged'? Are these 'real' cores, or HT cores? In the Core2 and P4 archs, HT's actually slowed down a good many workloads unless they were tightly constructed to work on the same data in cache. Else, those HT's did just enough extra work to block cache contents more than anything else. What's the disk I/O look like? I mean don't just focus on idle cores -- if the wait is on disk, maybe the cores can't get the data fast enough. If the network is involved, well, that's a drag on any message checking. I'm seeing times of .3msgs/sec, but I think that's with networking turned on. Pretty Ugly. poifgh wrote: Henrik K wrote: Yeah, given that my 4x3Ghz box masscheck peaks at 22 msgs/sec, without Net/AWL/Bayes. But that's the 3.3 SVN ruleset.. wonder what version was used and any nondefault rules/settings? Certainly sounds strange that 1 core could top out the same. Anyone else have figures? Maybe I've borked something myself.. The problem is not with 22 being a low number, but when we have other free cores to run different SA parallely why doesnt the throughput scale linearly .. I expect for 8 cores with 8 SA running simultaneously the number to be 150+ msgs/sec but it is 1/3rd at 50 msgs/sec
Re: Parallelizing Spam Assassin
May I point out, that while you may find the language crude -- it isn't language that would violate FTC standards in that in used any of the 7 or so 'unmentionable words'... People -- these standards of 'crude language' really need to be strongly held 'in check' -- the US is 'supposed' to be the society of 'free speech' unless it is obscene or threatening. I don't think his posting was either (BTW, I've never even 'heard' or seen his name before this post. All I saw was his 'uk' addr -- and I've known a few 'uk' types, and many of them sound very crude to an American ear these days. So in addition to applying strictures in a conservative manner, we must, hopefully, try to be sensitive to different cultural backgrounds. If I was talking with a black teen from downtown SF/Oakland, I'd have to translate from Eubonics -- which can sound rather crude and might contain and F-word every other sentence. I just apply my linguistic filter and attempt to get the meaning. I hardly thing this list is aimed at an young audience -- and kid 13+ is going to have heard quite an ear-full of 'colorful explicatives' from ST4:Voyage home (a family movie), to everyday peer talk. Yes -- it sounded crude...more than I, normally hear in America -- but not more than I'd hear in London. Just my 2-cents on cultural sensitivity, and the ability to be amused at cultural differences (rather than choosing to be offended by them). p.s. - Most Commercial vendor products are Bantha Poodoo -- especially for Virus/Security and Spam protection, but NOT all. Usually the highest advertised profile are the worst -- they put more budget into advertising than engineering. Yeah, I still thing SA is a bit slow, but I put much of that up to it being written in an interpretive language and it's wide flexibility and extensibility with plug-ins. Whatcha gonna do? Maybe we should rewrite it in Forth? *grin*...
Re: Parallelizing Spam Assassin
It's an American thing. Things that are normal speech for UK blokes, get Americans all disturbed. Funny, used to be the other way around...but well...times change. Justin Mason wrote: On Fri, Jul 31, 2009 at 09:32, rich...@buzzhost.co.uk wrote: Imagine what Barracuda Networks could do with that if they did not fill their gay little boxes with hardware rubbish from the floors of MSI and supermicro. Jesus, try and process that many messages with a $30,000 Barracuda and watch support bitch 'You are fully scanning to much mail and making our rubbish hardware wet the bed.' LOL. Richard -- please watch your language. This is a public mailing list, and offensive language here is inappropriate.
Re: AWL functionality messed up?
Jeff Mincy wrote: From: Linda Walsh Date: Wed, 27 May 2009 12:48:43 -0700 Bowie Bailey wrote: > At face value, this seems very counter productive. You still aren't understanding the wiki or the AWL scoring or what AWL is trying to do. Ah, but it only seems I'm daft, today...:-) If I get spam from 1000 senders, they all end up in my AWL??? yes. every email+ip address pair that sends you email winds up in your AWL with an average score for that pair. This is ok. GRRRnot so ok in my mindset, but ... and ... errr.. well that only makes it more confusing, in a way...since I was only 99% certain that I'd never gotten any HAM from hostname '518501.com' (thinking for a short period that AWL might be classify things by hosts as reliable or not, instead of, or in addition to by email-addr), but I'm 99.97% certain I've never gotten any HAM from user 'paypal.notify' (at) hostname '5185 AWL should only be added to by emails judged to be 'ham' via the feed back mechanisms --, spammers shouldn't get bonuses for being repeat senders... You are getting too attached to the 'whitelist' part of the name. Pretend AWL stands for average weighting list. = Aw...come on. Isn't the world difficult enough without changing white to black or white to weighing? I mean, we humans have enough trouble agreeing on what our symbols, "words" mean in relation to concepts and all without ya goin' and redefining perfectly good acceptable symbols to mean something else completely and still claim it to be some semblance of English. No wonder most of the non-techno-literate humans on this world regard us techies with a hint of suspicion regarding the difficulty of problems. We go around redefining words to suit reality and catch the heat when the rest of the world doesn't understand our meaning: Pointy-Haired Boss: "Well, how long did you say it would take?" Geek: "Well, I said it was 3-4 weeks worth of work." PHB: "Then why has it been 6 weeks with no product? I told you anything over 4 weeks was unacceptable!" G: "6 weeks, but...to get under 4 weeks, I assumed you were talking 168-hour pure-programming time weeks -- not CALENDAR weeks!" AWL isn't whitelisting spammers. It is pushing the score to the average for that sender. The sender can have a high average or a low average. --- An average? So it keeps the scores of all the past emails of every email we ever got sent? Must just store a weighted average -- otherwise the space (hmm...someone said something about 80MB+ auto-whitelist DB files?) Why not call it the Historically Based Score Normalizer or HBSN module? Db file could be "historical-norms" or something. If the previous email from a particular sender was FP or FN then AWL will have an incorrect average and will wind up doing or trying to do the wrong thing with subsequent email for that sender. Maybe it shouldn't add in the 'average' unless it exceeds the 'auto-learning threshold'?? I.e. something like the 'bayes_auto_learn_threshold_nonspam' for HAM and the 'bayes_auto_learn_threshold_spam' for SPAM. Assuming it doesn't already do such a thing, it would make a little sense...so as not to train it on 'bad data'... When I run "sa-learn --spam " over a message, can I assume (or is it the case) that telling SA, a message was 'spam' would assign a sufficiently large value to the 'HBSN' value for that sender to reduce any effect of having falsely (if it is likely to happen) incorrect value? Or might I at least assume that each "sa-learn" over a message will modify it's AWL score appropriately? You can remove addresses using spamassassin --remove-from-whitelist Yes...saw that after visiting the wiki. Is there a --show-whitelist-with-current-scores-and-their-weight switch as well (as opposed to one that only showed the addr's in the white list, or only showed the non-weighted scores)? Thanks...and um... How difficult would it be to have the name of the module reflect what it's actually doing? maybe roll out a name change with the next ".dot" release of SA? (3.3? 3.4?) Might alleviate some amount of confusion(?)... Does the AWL also keep track of when it last saw an 'email' addr so it can 'expire' the oldest entries so the db doesn't grow to eventually consume all forms of matter and energy in the universe? :-) Thanks for the clarification and info!! -linda
Re: my AWL messed up?
Bowie Bailey wrote: Linda Walsh wrote: I got a really poorly scored piece of spam -- one thing that stood out as weird was report claimed the sender was in my AWL. Any sender who has sent mail to you previously will be in your AWL. This is probably the most misunderstood component of SA. Read the wiki. http://wiki.apache.org/spamassassin/AutoWhitelist --- To be clear about what is being white listed, would it hurt if the 'brief report for the AWL', instead of : -1.3 AWLAWL: From: address is in the auto white-list it had -1.3 AWLAWL: 'From: 518501.com' addr is in auto white-list So I can see what domain it is flagging with a 'white' value? I don't know of any emails from '518501.com' that wouldn't have been classified spam, so none should have a 'negative value'.
AWL functionality messed up?
Bowie Bailey wrote: Linda Walsh wrote: I got a really poorly scored piece of spam -- one thing that stood out as weird was report claimed the sender was in my AWL. Any sender who has sent mail to you previously will be in your AWL. This is probably the most misunderstood component of SA. Read the wiki. http://wiki.apache.org/spamassassin/AutoWhitelist At face value, this seems very counter productive. If I get spam from 1000 senders, they all end up in my AWL??? WTF? AWL should only be added to by emails judged to be 'ham' via the feed back mechanisms --, spammers shouldn't get bonuses for being repeat senders... How do I delete spammer addresses from my 'auto-white-list'? (That's just insane..whitelisting spammers?!?!)
Re: new netset warn msg (howto avoid?)
Jari Fredriksson wrote: I see this message coming out of my SA alot these days since upgrading to 3.2.5: [23920] warn: netset: cannot include 127.0.0.0/8 as it has already been included Where is this local net being 'included', and how can I suppress the duplicate inclusion message? Thanks, linda It is in /etc/spamassassin/local.cf as "internal networks" or "trusted networks" setting. Remove it and no message should be shown. Ah... In earlier SA version, I used to have 'internal_network 127.' on a line by itself. Is the "127" network now 'built-in' in the current SA? Thanks in Advance! linda
new netset warn msg (howto avoid?)
I see this message coming out of my SA alot these days since upgrading to 3.2.5: [23920] warn: netset: cannot include 127.0.0.0/8 as it has already been included Where is this local net being 'included', and how can I suppress the duplicate inclusion message? Thanks, linda
Re: user-db size, excess growth...limits ignored
LuKreme wrote: On 1-Apr-2009, at 13:27, Linda Walsh wrote: *ouch* -- you mean each message writes out an 80MB white-list file? That's alot of I/O per message, no wonder spamd seems to be slowing down... No these are DB files. Data is added to them, this does not necessitate rewriting the entire file. --- Yeah -- then this refers back to the bug about there being no way to prune that file -- it just slowly grows and needs to be read in when spamd starts(?) and spamd needs to keep that info around as the basis for its AWL scoring, no? So the only real harm is the increased read-initialization and the run-time AWL length?
Re: user-db size, excess growth...limits ignored
01234567890123456789012345678901234567890123456789012345678901234567890123456789 Matt Kettler wrote: Linda Walsh wrote: Matt Kettler wrote: I see 3 DB's in my user directory (.spamassassin). auto-whitelist (~80MB), bayes_seen (~40MB), bayes_toks (~20MB) expiry will only affect bayes_toks. Currently neither auto-whitelist nor bayes_seen have any expiry mechanism at all. --- So they just grow without limit? Yep. Not ideal, and there's bugs open on both. How often does the whitelist get sync'd to disk? In the case of the whitelist, it's per-message. - *ouch* -- you mean each message writes out an 80MB white-list file? That's alot of I/O per message, no wonder spamd seems to be slowing down... Having changed the user_prefs files back to the default setting (i.e. deleted my previous addition) -- 2 days ago, and system was rebooted 1day14hours ago, I'm certain spamd has been restarted. Hmm, can you set bayes_expiry_max_db_size in a user_prefs file? That seems like an option that might be privileged and only honored at the site-wide level. An absurdly large value can bog the whole server down when processing mail, so an end user could DoS your machine if allowed to set this. I *thought* I could set it -- certainly, the only place I *increased* the tokens beyond the *default* was in user-prefs. That *seems to have worked in bumping up the toks to 500K, but, now, lowering it, is being ignored. Perhaps the user-pref option to set #tokens changed and an old version allowed it and raised it to 500K, but newer version disallows so I can't 'relower' it (though I'd think global 150K limit would have been re-applied). That said, 3.1.7 is vulnerable to CVE-2007-0451 and CVE-2007-2873. You should seriously consider upgrading for the first one. - While I was supporting multiple local users at one point, I'm only local user, so local-user escalation to create local service denial isn't top-most concern. Doesn't mean shouldn't upgrade for other reasons. I'm still *Greatly* concerned about an 80MB file being written to disk potentially on every email message incoming. That's seems a high overhead, or are their mitigating factors that decrease that amount under 99% of the cases? Tnx, Linda
One BUG found: userpref whitelist pattern BUG/DOC prob;
Bowie Bailey wrote: Linda Walsh wrote: I get many emails addressed to internal sendmail 's. 123...@mydomain, 1abd56.ef7...@mydomain (seem to fit a basic pattern but don't know how to specify the pattern (or I don't have it right): <(start of an email-address)>[0-9][0-9a-fa-f\@mydomain I think this is what you are looking for (untested): header MY_NUMBER_EMAIL To:addr =~ /^\d[0-9a-f\@mydomain/i Look in the "Rule Definition" section of the man page for Mail::SpamAssassin::Conf for more info on the ':addr' option. -- I found, BURIED, in the doc "Mail::SpamAssassin::Conf the broken, primitive rules for white/black list patterns allowed: Whitelist and blacklist addresses are now file-glob-style patterns, so "fri...@somewhere.com", "*...@isp.com", or "*.domain.net" will all work. Specifically, "*" and "?" are allowed, but all other metacharacters are not. Regular expressions are not used for security reasons. === These are NOT file-glob style patterns. As on linux These are examples of non-regex file-glob patterns that don't work under SA: "[0-9][0-9a-f]*.domain", "[0-9]*.domain", "[^0-9]*.domain". They don't work: the "bracket notation for a single character" doesn't work. 1) Instead you need: 0*.domain 1*.domain 2*.domain 3*.domain 4*.domain 5*.domain 6*.domain 7*.domain 8*.domain 9*.domain - 2) There is no way to express negation. 3) The documentation is ALSO unclear if the expression is a full or partial match, as "^ and "$" are also not included. So unclear if "@domain" is same as "*...@domain". Attempts to match of form: "^[0-9][0-9a-f].*.domain$" (ex: "0...@domain") fail to match as well as any more complex file-glob. white/black lists should not claim 'file-glob' matching ability if they don't even include single char 'range' matches. This was the answer THAT NO ONE understood or could answer. If the format of white/black list entries in 'userprefs' is SO arcane, limited, and poorly documented, I assert it is a bug. Short-term, documentation would be quickest fix (get rid of file-glob description as it's not true in normal sense of fileglob, but longer term, might be real-file globs AND making clear whether the pattern provided must match the full email address, or if a partial match will be considered a a positive match (i.e. "@foobar" is same as "*...@foobar*") Sorry if I am coming across a bit terse, but this hard-to-find and misleading description has been a long-term "bug" in my filtering rules. Seems like a alot of email-harvesting progs see mail-ID numbers like "12345.6ab...@domain" as email addrs, which in my setup are completely bogus. -linda
Re: user-db size, content confusions (how many toks?)
Matt Kettler wrote: I see 3 DB's in my user directory (.spamassassin). auto-whitelist (~80MB), bayes_seen (~40MB), bayes_toks (~20MB) Was trying to find relation of 'bayes_expiry_max_db_size' to the physical size of the above files. --- expiry will only affect bayes_toks. Currently neither auto-whitelist nor bayes_seen have any expiry mechanism at all. --- So they just grow without limit? How often are they loaded? Does only "spamd" access the auto-whitelist? Optimally, I would assume spamd opens it upon start, but it needs to update the disk file periodically (sync the db) for reliability. How often does it 'sync'? bayes_seen can safely be deleted if you need to. It keeps track of what messages have already been learned to prevent relearning them. However, unless you're likely to re-feed messages to SA, bayes_seen isn't stictly neccesary. --- Only refeeding would usually be 'ham', because I might rerun over an "Inbox", that might have old messages in it. I don't rerun "ham" training often -- except to "despam" a message (one that was marked spam and shouldn't have been). I'm finding some answers, I've run into some seeming "contradictions". ... --- First prob(contradiction). dbg above says "token count: 0". (This is with a combined bayes db size of 60MB (_seen, _toks). Are you sure your sa-learn was using the same DB path? --- Sure?? It listed the same filename (default location /home//.spamassasssin/). Other than that, I haven't tried to trace perl running spamassassin, to see if it is really accessing the same file. Only going off the 'debug' messages (which correspond to the settings in "user_prefs" that's in the default location dir. From the sounds of it, sa-learn is using a directory with an empty DB. Yeah...Doesn't make sense to me -- how would "sa-learn --dump magic" use a different location? I.e. it showed ~500K tokens... I.e. isn't 'ntokens' = 491743 mean slightly under 500K tokens Yep, looks like you have 491,743 tokens to me. It's like the sa-learn magic shows a 'db' corresponding to my old limit (that I think is still being 'auto-expired', so might not have pruned figure as it runs about once per 24 hours, if I understand normal spamd workings). Approximately. Also, be aware that in order for spamd to use new settings it needs to be restarted. Having changed the user_prefs files back to the default setting (i.e. deleted my previous addition) -- 2 days ago, and system was rebooted 1day14hours ago, I'm certain spamd has been restarted. YET: all db sizes are the same as before (no reduction in size corresponding to going 'back' to a default 150K limit), though sa-learn run with dbg and force-expire indicated 0 tokens -- but sa-learn w/dump magic indicates 500K tokens. How can "expire" say 0 toks but dump-magic say 500K? File timemstamps show all 3-db files have been updated today. (Presumably by spamd processing email as it comes in). But file sizes still are @ sizes indicated at top of this message: 80/40/20-MB. So is the --magic output, maybe what is seen and being 'size-controlled' by auto-expire? Yes, at least, it should be. Why isn't 'sa-learn --force expire' seeing the TOKENs indicated in sa-learn --dump magic? That is particularly strange to me, and it sounds like there's some problems there. --- *sigh* Can you give a bit of detail, ie: what paths are you looking at for the files, what version of SA, --- SA = old version of 3.1.7. Which at very least points to an upgrade possibly solving the problem, BUT, this was working at one point, and don't know why it 'stopped'. I'm generally uncomfortable with fixing things that were working just because they have randomly stopped working without knowing *why*, (though that discomfort has become something I've just more had to deal with as the Microsoft SW maintenance method becomes the norm (update and see if bug is gone...yes? ok, bug gone; (unclear if fixed or hidden, unclear about effects of other changes in a new version...) Am I misinterpreting the debug output? No, you don't seem to be. --- Thanks for the confirmation of my 'reality'. Really, the most logical and time-efficient way to proceed is likely to upgrade to newer version at some point soon (and ignore my discontent regarding 'not knowing' why or what caused the break). *sigh* Linda
user-db size, content confusions (how many toks?)
I see 3 DB's in my user directory (.spamassassin). auto-whitelist (~80MB) bayes_seen (~40MB) bayes_toks (~20MB) Was trying to find relation of 'bayes_expiry_max_db_size' to the physical size of the above files. I'm finding some answers, I've run into some seeming "contradictions". Had db_size set to 500,000, reduced to 250,000 and to 'default' (150,000) during testing. In trying to lower 'db_size' and see how that affected physical sizes, I ran sa-learn --force expires and saw these debug messages of 'Note': [30905] dbg: bayes: expiry check keep size, 0.75 * max: 112500 [30905] dbg: bayes: token count: 0, final goal reduction size: -112500 [30905] dbg: bayes: reduction goal of -112500 is under 1,000 tokens, skipping expire [30905] dbg: bayes: expiry completed --- First prob(contradiction). dbg above says "token count: 0". (This is with a combined bayes db size of 60MB (_seen, _toks). Seems to think I have no bayes data. Saw another dbg msg that indicated the bayes classifier was untrained (<~150? entries) & disabled. Dunno how it got zeroed, but tried adding 'ham' by running sa-learn over my a despam'ed mailbox. First run showed: Learned tokens from 55 message(s) (55 message(s) examined) But subsequent runs of 'sa-learn with dbg+expire" still show token count: 0. sa-learn --dump magic shows something different: 0.000 0 3 0 non-token data: bayes db version 0.000 0 556414 0 non-token data: nspam 0.000 0 574441 0 non-token data: nham 0.000 0 491743 0 non-token data: ntokens 0.000 0 1216456288 0 non-token data: oldest atime 0.000 0 1237796146 0 non-token data: newest atime 0.000 0 1220476831 0 non-token data: last journal sync atime 0.000 0 1217838535 0 non-token data: last expiry atime 0.000 01382400 0 non-token data: last expire atime delta 0.000 0 70612 0 non-token data: last expire reduction count - Does the above indicate 0 tokens? I.e. isn't 'ntokens' = 491743 mean slightly under 500K tokens (my original limit before trying to run 'sa-learn -expires + dbg' manually). It's like the sa-learn magic shows a 'db' corresponding to my old limit (that I think is still being 'auto-expired', so might not have pruned figure as it runs about once per 24 hours, if I understand normal spamd workings). So is the --magic output, maybe what is seen and being 'size-controlled' by auto-expire (was ~500K before recent test changes). Why isn't 'sa-learn --force expire' seeing the TOKENs indicated in sa-learn --dump magic? Debug messages are pointing at the same file for both operations, so how can dump-magic indicated 500K, but the debug of sa-learn --force-expire, is somehow seeing 0 TOKENs? Am I misinterpreting the debug output? Thanks, Linda
Re: What is AWL: _Average-Whitelister_....
John Hardin wrote: What is AWL rule? Why it gives so different amount of points? "Auto Whitelist" is a misleading name. It is actually a score averager. Since the points it applies are based on the historical scoring from that sender, the score will vary by who the sender is and when the message is processed (i.e. their history to-date). --- Thank you for the clear and simple explanation. Perhaps: AWL (AutomaticWhiteList) should be renamed to: AWL (Averaging-Whitelister) While the acronym would/could stay the same, the standard expanded form should say it is an "Averaging" - something (whitelister, blacklister, whatever). The important point is not that it's automatically applied, but that it's _A_veraging... This clarifies this long outstanding Q for me as well. Thanks! Linda
Re: userpref whitelist pattern problem
LuKreme wrote: On 13-Mar-2009, at 12:58, Linda Walsh wrote: I get many emails addressed to internal sendmail 's. 123...@mydomain or 1abd56.ef7...@mydomain (seem to fit a basic pattern but don't know how to specify the pattern (or I don't have it right): <(start of an email-address)>[0-9][0-9a-fa-f\@mydomain Generally: ^ means 'start of line', $ means 'end of line' but whitelist_from used globbing, and I don't think you can use those anchors there. are the emaisl comming in without a tld (1...@example and not 1...@example.com)? All with a tld, but two forms, am trying to catch: from: 11234.2a...@somedomain.tld(or) from: larry <11234.2a...@somedomain.tld> any hints would be appreciated... running slightly older SA 3.1.7 on perl 5.8.8 Slightly? No, that's ancient (2.5 years!!). --- Sometimes I understate, sometimes overstate. Hard to tell by "dot." name unless one has been paying attention on all their products. I guess it is difficult to tell by the version-dot number. Seriously, if there is only one thing you keep updated on your mailserver, it needs to be SpamAssassin. --- Good to know, was doing better. But got out of sync when I couldn't get perl-libs to update cleaning during some perl version update. Between updating: - Perl versions... - Distro RPM versions, - CPAN module versions. -* requirements of SW dependent on Perl modules (i.e. SpamAssassin) ...one thing leads to another...and before you know it (up to butt in alligators?)... :-) Tnx, Linda
Re: whitelist pattern problem in userpref-whitelisting
Does the below apply to the ~/.spamassassin/userprefs whitelisting (command, keyword or feature)... Sorry...it was the whitelisting in the userpref file that I was talking about the "primitive pattern matching" At one point it was limited to DOS-like file-matching patterns, not the full perlregexp set (which they below example you gave me would be an excellent example!) ... I don't see 'header' as a usable line in "userprefs". thanks, -linda Bowie Bailey wrote: Linda Walsh wrote: > I get many emails addressed to internal sendmail 's. > 123...@mydomain, 1abd56.ef7...@mydomain > (seem to fit a basic pattern but don't know how to specify the > pattern (or I don't have it right)): > <(start of an email-address)>[0-9][0-9a-fa-f\@mydomain > > by start of an email, addr, I mean inside or outside literal '<>'. > I try matching to '<' as a start char to look for anything starting > with a number, but that fails if they don't use the "name " > format, but just use "x...@yy". Don't know how to root at beginning > of any email address looking thing. I think this is what you are looking for (untested): header MY_NUMBER_EMAIL To:addr =~ /^\d[0-9a-f\@mydomain/i Look in the "Rule Definition" section of the man page for Mail::SpamAssassin::Conf for more info on the ':addr' option. > I know the pattern matcher in the userprefs file is primitive though > -- like DOS level file matching, so I don't know how to write > it in userprefs... user_prefs uses the exact same pattern matching as the rest of SA (Perl regexps). It is anything but primitive. The caveat being that rule definitions are not allowed in user_prefs files unless you allow it by putting this in your local.cf: allow_user_rules 1 > any hints would be appreciated... > running slightly older SA 3.1.7 on perl 5.8.8 > > intending to update ... eventually but don't know that this would > solve any pattern help Shouldn't make any difference for this.
whitelist pattern problem
I get many emails addressed to internal sendmail 's. 123...@mydomain 1abd56.ef7...@mydomain (seem to fit a basic pattern but don't know how to specify the pattern (or I don't have it right): <(start of an email-address)>[0-9][0-9a-fa-f\@mydomain by start of an email, addr, I mean inside or outside literal '<>'. I try matching to '<' as a start char to look for anything starting with a number, but that fails if they don't use the "name " format, but just use "x...@yy". Don't know how to root at beginning of any email address looking thing. I know the pattern matcher in the userprefs file is primitive though -- like DOS level file matching, so I don't know how to write it in userprefs... any hints would be appreciated... running slightly older SA 3.1.7 on perl 5.8.8 intending to update ... eventually but don't know that this would solve any pattern help Thanks, -linda
RFE? Or is there an easy way to do this?
I have some email accounts that I use with particular vendors or lists. I have a few email accounts only known to a single person or company. What I'd like to do is someway of white-listing a "to-addr" if it is from a list of "from-addrs"else add something (constant?) to its spam score. An even more advanced but non-trivial check would be "if to addr(X), and not in my contacts(addr-book), then SPAM, else ok Anyone else have their ways to do these checks? thanks, -linda
Re: junkfiles-bays_toks.expire\d{4-5}
A manual expire run took less than 2 minutes -- closer to 1 minuteHow impatient is SA ?? John Hardin wrote: On Fri, 2008-07-25 at 18:35 -0700, Linda Walsh wrote: Jul 25 15:28:21 Ishtar spamd[2355]: bayes: expire_old_tokens: child processing timeout at /usr/bin/spamd line 1085, line 22. Your autoexpire is taking longer than SA is willing to wait. This is a fairly common question, there's lots of discussion in the list archives. Consensus: disable autoexpire and run a dedicated expiry from cron, weekly or daily based on your token volume.
Mail::SpamAssassin 3.2.5 fails: NOT OK
Can't install Mail::SpamAssassin in CPAN... fails at the end... (not whole log, but enough to give context, I hope */usr/bin/perl build/preprocessor -Mconditional -Mvars -DVERSION="3.002005" -DPREFIX="/usr" -DDEF_RULES_DIR="/usr/share/spamassassin" -DLOCAL_RULES_DIR="/etc/mail/spamassassin" -DLOCAL_STATE_DIR="/var/lib/spamassassin" >blib/lib/Mail/SpamAssassin/Locker/Flock.pm /usr/bin/perl build/preprocessor -Mconditional -Mvars -DVERSION="3.002005" -DPREFIX="/usr" -DDEF_RULES_DIR="/usr/share/spamassassin" -DLOCAL_RULES_DIR="/etc/mail/spamassassin" -DLOCAL_STATE_DIR="/var/lib/spamassassin" >blib/lib/Mail/SpamAssassin/BayesStore/SQL.pm yes /usr/bin/perl build/preprocessor -Mconditional -Mvars -DVERSION="3.002005" -DPREFIX="/usr" -DDEF_RULES_DIR="/usr/share/spamassassin" -DLOCAL_RULES_DIR="/etc/mail/spamassassin" -DLOCAL_STATE_DIR="/var/lib/spamassassin" blib/lib/Mail/SpamAssassin/Logger/Syslog.pm checking for inttypes.h... /usr/bin/perl build/preprocessor -Mconditional -Mvars -DVERSION="3.002005" -DPREFIX="/usr" -DDEF_RULES_DIR="/usr/share/spamassassin" -DLOCAL_RULES_DIR="/etc/mail/spamassassin" -DLOCAL_STATE_DIR="/var/lib/spamassassin" >blib/lib/Mail/SpamAssassin/Plugin/SPF.pm /usr/bin/perl build/preprocessor -Mconditional -Mvars -DVERSION="3.002005" -DPREFIX="/usr" -DDEF_RULES_DIR="/usr/share/spamassassin" -DLOCAL_RULES_DIR="/etc/mail/spamassassin" -DLOCAL_STATE_DIR="/var/lib/spamassassin" >blib/lib/Mail/SpamAssassin/Plugin/Shortcircuit.pm /usr/bin/perl build/preprocessor -Mconditional -Mvars -DVERSION="3.002005" -DPREFIX="/usr" -DDEF_RULES_DIR="/usr/share/spamassassin" -DLOCAL_RULES_DIR="/etc/mail/spamassassin" -DLOCAL_STATE_DIR="/var/lib/spamassassin" >blib/lib/Mail/SpamAssassin/Client.pm /usr/bin/perl build/preprocessor -Mconditional -Mvars -DVERSION="3.002005" -DPREFIX="/usr" -DDEF_RULES_DIR="/usr/share/spamassassin" -DLOCAL_RULES_DIR="/etc/mail/spamassassin" -DLOCAL_STATE_DIR="/var/lib/spamassassin" >blib/lib/Mail/SpamAssassin/PerMsgStatus.pm yes checking for stdint.h... /usr/bin/perl build/preprocessor -Mconditional -Mvars -DVERSION="3.002005" -DPREFIX="/usr" -DDEF_RULES_DIR="/usr/share/spamassassin" -DLOCAL_RULES_DIR="/etc/mail/spamassassin" -DLOCAL_STATE_DIR="/var/lib/spamassassin" >blib/lib/Mail/SpamAssassin/Plugin/URIDetail.pm /usr/bin/perl build/preprocessor -Mconditional -Mvars -DVERSION="3.002005" -DPREFIX="/usr" -DDEF_RULES_DIR="/usr/share/spamassassin" -DLOCAL_RULES_DIR="/etc/mail/spamassassin" -DLOCAL_STATE_DIR="/var/lib/spamassassin" blib/lib/Mail/SpamAssassin/PerMsgLearner.pm /usr/bin/perl build/preprocessor -Mconditional -Mvars -DVERSION="3.002005" -DPREFIX="/usr" -DDEF_RULES_DIR="/usr/share/spamassassin" -DLOCAL_RULES_DIR="/etc/mail/spamassassin" -DLOCAL_STATE_DIR="/var/lib/spamassassin" >blib/lib/Mail/SpamAssassin/Plugin/WLBLEval.pm yes checking for unistd.h... /usr/bin/perl build/preprocessor -Mconditional -Mvars -DVERSION="3.002005" -DPREFIX="/usr" -DDEF_RULES_DIR="/usr/share/spamassassin" -DLOCAL_RULES_DIR="/etc/mail/spamassassin" -DLOCAL_STATE_DIR="/var/lib/spamassassin" >blib/lib/Mail/SpamAssassin/Plugin/AutoLearnThreshold.pm /usr/bin/perl build/preprocessor -Mconditional -Mvars -DVERSION="3.002005" -DPREFIX="/usr" -DDEF_RULES_DIR="/usr/share/spamassassin" -DLOCAL_RULES_DIR="/etc/mail/spamassassin" -DLOCAL_STATE_DIR="/var/lib/spamassassin" >blib/lib/Mail/SpamAssassin/SQLBasedAddrList.pm /usr/bin/perl build/preprocessor -Mconditional -Mvars -DVERSION="3.002005" -DPREFIX="/usr" -DDEF_RULES_DIR="/usr/share/spamassassin" -DLOCAL_RULES_DIR="/etc/mail/spamassassin" -DLOCAL_STATE_DIR="/var/lib/spamassassin" >blib/lib/Mail/SpamAssassin/Plugin/TextCat.pm yes checking sys/time.h usability... /usr/bin/perl build/preprocessor -Mconditional -Mvars -DVERSION="3.002005" -DPREFIX="/usr" -DDEF_RULES_DIR="/usr/share/spamassassin" -DLOCAL_RULES_DIR="/etc/mail/spamassassin" -DLOCAL_STATE_DIR="/var/lib/spamassassin" >blib/lib/Mail/SpamAssassin/PersistentAddrList.pm /usr/bin/perl build/preprocessor -Mconditional -Mvars -DVERSION="3.002005" -DPREFIX="/usr" -DDEF_RULES_DIR="/usr/share/spamassassin" -DLOCAL_RULES_DIR="/etc/mail/spamassassin" -DLOCAL_STATE_DIR="/var/lib/spamassassin" >blib/lib/Mail/SpamAssassin/Locker/UnixNFSSafe.pm /usr/bin/perl build/preprocessor -Mconditional -Mvars -DVERSION="3.002005" -DPREFIX="/usr" -DDEF_RULES_DIR="/usr/share/spamassassin" -DLOCAL_RULES_DIR="/etc/mail/spamassassin" -DLOCAL_STATE_DIR="/var/lib/spamassassin" >blib/lib/Mail/SpamAssassin/DnsResolver.pm yes checking sys/time.h presence... /usr/bin/perl build/preprocessor -Mconditional -Mvars -DVERSION="3.002005" -DPREFIX="/usr" -DDEF_RULES_DIR="/usr/share/spamassassin" -DLOCAL_RULES_DIR="/etc/mail/spamassassin" -DLOCAL_STATE_DIR="/var/lib/spamassassin" >blib/lib/Mail/SpamAssassin/SubProcBackChannel.pm yes checking for sys/time.h... /usr/bin/perl build/preprocessor -Mconditional -Mvars -DVERSION="3.002005" -DPREFIX="/usr" -DDEF_RULES_DIR="/usr/
Re: junkfiles-bays_toks.expire\d{4-5}
Matt Kettler wrote: What version are you running? reading around the child processing timeout seems to have been a common problem in the 3.1.x series, but I've not seen it reported in the 3.2.x series. --- Erp. I'll try upgrading and see what happens...still have a 3.1.7 installed.
Re: junkfiles-bays_toks.expire\d{4-5}
Matt Kettler wrote: The fact that they keep laying around is a problem. This suggests SA keeps getting killed before the expire can complete. Do you have any kind of limits set such as CPU time or memory that SA might be running against and dying? You can try kicking off an expire manually using sa-learn --force-expire. (add -D if you want some debug output).. note: this could run for a long time, particularly if bayes_toks is really large. Another one of the fileles appeared -- 17M long. while bayes_toks is 8.8M. auto-whitelist is 78M -- that seems a bit excessive... Don't know what really large means -- bayes_toks isn't that large compared to some of the other files. No limits that I know of... Ahh...seeing some oddness in the log though: (Interrupted?timeouts?...weird...)... Jul 25 15:23:59 Ishtar spamd[2447]: bayes: cannot open bayes databases /home/user/.spamassassin/bayes_* R/W: lock failed: Interrupted system call Jul 25 15:24:48 Ishtar spamd[2443]: bayes: cannot open bayes databases /home/user/.spamassassin/bayes_* R/W: lock failed: Interrupted system call Jul 25 15:28:21 Ishtar spamd[2355]: bayes: expire_old_tokens: child processing timeout at /usr/bin/spamd line 1085, line 22. Jul 25 15:36:55 Ishtar spamd[2447]: bayes: cannot open bayes databases /home/user/.spamassassin/bayes_* R/W: lock failed: Interrupted system call Jul 25 15:41:38 Ishtar spamd[2443]: bayes: expire_old_tokens: child processing timeout at /usr/bin/spamd line 1085. Jul 25 16:14:14 Ishtar spamd[2355]: bayes: cannot open bayes databases /home/user/.spamassassin/bayes_* R/W: lock failed: Interrupted system call Jul 25 16:19:00 Ishtar spamd[2385]: bayes: expire_old_tokens: child processing timeout at /usr/bin/spamd line 1085. Jul 25 16:29:05 Ishtar spamd[2356]: bayes: expire_old_tokens: child processing timeout at /usr/bin/spamd line 1085. Jul 25 17:06:02 Ishtar spamd[2385]: bayes: cannot open bayes databases /home/user/.spamassassin/bayes_* R/W: lock failed: Interrupted system call
junkfiles-bays_toks.expire\d{4-5}
In my .spamassassin dir, I see lots of files that look like: bayes_toks.expire1098 bayes_toks.expire1243 bayes_toks.expire13494 bayes_toks.expire15029 bayes_toks.expire15761 bayes_toks.expire16349 bayes_toks.expire17370 bayes_toks.expire17385 bayes_toks.expire1754 bayes_toks.expire18183 bayes_toks.expire18584 bayes_toks.expire18813 bayes_toks.expire19274 bayes_toks.expire19481 bayes_toks.expire20721 bayes_toks.expire2264 bayes_toks.expire2265 bayes_toks.expire2266 bayes_toks.expire2267 bayes_toks.expire22670 bayes_toks.expire2268 bayes_toks.expire2324 bayes_toks.expire2327 bayes_toks.expire2355 bayes_toks.expire2356 bayes_toks.expire2385 bayes_toks.expire23960 bayes_toks.expire2443 bayes_toks.expire2447 bayes_toks.expire25435 bayes_toks.expire26900 bayes_toks.expire29828 bayes_toks.expire31304 bayes_toks.expire3343 bayes_toks.expire3442 bayes_toks.expire3444 bayes_toks.expire4002 bayes_toks.expire4334 bayes_toks.expire4877 bayes_toks.expire5636 bayes_toks.expire5683 bayes_toks.expire5779 bayes_toks.expire6464 bayes_toks.expire9281 bayes_toks.expire9300 They are all a few hundred K or more long (just deleted the bunch). I've also noticed spamd going off and cranking for more than an hour -- seems to produce one of these files... Any idea what they are for? or why SA would keep leaving them in my .sa (user) dir?
mem use of spamd processes: wasted memory? 'bug'?
I noticed something about my spamd processes. There is a "main" process at the top that spawns children. 5 of 6 of the top memory (by %) are 'spamd'. 5/6 top Resident (28M for parent), 40m-49m /child (268M total + parent) 5/7 top Data users (26M for parent) 38-47m/child (259M total + parent) So since all of the spamd's are accessing the same databases on disk, how come there is so much that isn't shared? Shouldn't the clients be using mostly shared memory and the only non-shared would be individual emails? Would or should this be characterized as a design bug? -linda
Re: Discussion side point: levels of Trust
John Hardin wrote: On Wed, 11 Jun 2008, SM wrote: At 17:46 11-06-2008, Linda Walsh wrote: How does one decided on 'trust'? I.e. I think it would be useful to assign a probability to "Trust" at the least. I mean do I put my ISP in my trusted server list? -- suppose they start partnering with It could be a reputation system where you assign a probability. Probability of what, exactly? Bear in mind, "trusted" means "does not forge Received: headers", not "does not send or relay spam". I am aware of this. However, it's not an easily discerned number, but if I had att or comcast as an ISP, my trust in them would maybe be a trust value .7-.8. Like the ISP in Europe who insertted over 20million ads on HTML pages -- they could just as easily be adjusting return headers. But more worrysome are the cooperations of ISP's with the unconstitutional 'lawless intercept' actions by law enforcement agencies that are used to find and entrap end-users for any crime they wish to target. While the laws were sold on terrorist grounds, then later bolstered via the mantra "for the children for the children...its all the childporn" (expanded to apply to anyone under age 18). I could easily see the possibility of domain-information being corrupted -in real time- to allow intercept of traffic -- that could either be used in a 'honeypot' scenario, or just to monitor. While in some cases they ISP's have no choice but to cooperate, there have been several high profile ISP's (ATT, Verizon), who have handed over information without requiring any formal oversite or legal documents. That's scarey as the US moves more toward the corrupted-GOP's idealized police state. Hopefully we can get some serious regime change to undo some of these worst practices...but governments are notoriously bad about letting go of power once they've grabbed a hold of it.
Discussion side point: levels of Trust
Matthias Leisi wrote: 1) This advice: | Tue Jun 10 14:55:36 2008 [72096] dbg: conf: trusted_networks are not | configured; it is recommended that you configure trusted_networks manually should not be ignored. Setting trusted_networks would slightly reduce the number of DNS lookups and can avoid all sorts of funny error situations. How does one decided on 'trust'? I.e. I think it would be useful to assign a probability to "Trust" at the least. I mean do I put my ISP in my trusted server list? -- suppose they start partnering with an ad-firm? Or.. get bought-out? ... I probably won't know most of their internal politics... ISP's in some eastern state have already committed to filtering arbitrary sites based on local values and arbitrary listing policies(?) This whole 'save-the-child-porn' shtick the government is using as a necessary excuse to violate computer privacy is unacceptable. They did the same thing -- claimed they needed intrusive powers to protect against terrorists -- but 80% of the people they've used those powers against have been for 'common crimes' (or drug prosecutions). In the UK, they are using anti-terrorism surveillance-cams to enforce doggie-doodoo pickup laws! In the US, the government is using "passenger manifests" of arriving, overseas flights, to detain and arrest foreign businessmen and citizens on civil and non-violent criminal investigations. But those are general complaints about untrustworthiness of previously trustworthy entities I don't have a binary trust value, really. As an example, going from most trusted to least, I might have: - a lab/build/test machine (linux usually) - internal server proxy to out-net (linux) - windows XP desktop (its windows, no direct outside connect, but can proxy) - my ISP's servers - root DNS servers (arguably more trustworthy than most ISPs, but since I have to go through my ISP to get to them, _logically_, how can I trust them more?) - HTTPS-personal money sites...(for some things more trust than my ISP, but they are 'banks' -- so that trust is with some grains of 'salt' - Mainstream web-providers (varies based on reputation, but examples would include Google, BBC(.co.uk), various online businesses with physical presence, 'seem' more trustworthy (at least you know where they are based?) - government sites, Depends. from 'ok' trust to downright untrustworthy. - unknown sites / known bad sites...
Re: Warning: "xxx" matches null string many times in regex in Text/Wrap.pm..
I looked at this error and it appears to be caused by SpamAssassin passing in an incorrect parameter to "Text::Wrap" by changing the value of "Text::Wrap::break" to be something other than a "line breaking" character (or characters). In file in my Spamassassin-3.17 cpan dir, there is a bad line: ./lib/Mail/SpamAssassin/PerMsgStatus.pm:996: $hdr = Mail::SpamAssassin::Util::wrap($hdr, "\t", "", 79, 0, '(?<=[\s,])'); The last argument, '(?<=[\s,])' appears to be invalid. The error message is "(?:(?<=[\s,]))* matches null string many \ times in regex; marked by <-- HERE in m/\G(?:(?<=[\s,]))* <-- \ HERE \Z/ at /usr/lib/perl5/5.8.8/Text/Wrap.pm line 46. " In the Text::Wrap source code of 0704 (last working version), there is (at line 46, remarkably enough:-)), the line: while ($t !~ /\G\s*\Z/gc) { In versions 0711 and later, that line reads: while ($t !~ /\G(?:$break)*\Z/gc) { = Note that "\s" has been replaced by "(?:$break)". In the 0711 source code, $break defaults to '\s'. In other words -- it appears, from the code it replaces and from the default values of "$break" that "$break" should contain a pattern representing the characters to break on. However, in PerMsgStatus:996, we see a *zero-length* (the "(?<=pat)" part) pattern passed in for the value of $break. Instead of matching the line "break" character, it only matches the position and never matches the character itself -- thus it gets "stuck" applying the zero-length (null pattern) again and again (thus the message " matches null string many times". I'm not sure what the author was trying to do in PerMsgStatus.pm or who "owns" that "line" (or file), but perhaps they meant for "comma" to be included in the list of "break" characters. In which case, instead of: '(?<=[\s,])' for the last argument in line 996, it should be: '[\s,]' That is, line 996 in lib/Mail/SpamAssassin/PerMsgStatus.pm should be: $hdr = Mail::SpamAssassin::Util::wrap($hdr, "\t", "", 79, 0, '[\s,]'); (instead of: $hdr = Mail::SpamAssassin::Util::wrap($hdr, "\t", "", 79, 0, '(?<=[\s,])'); ) I hope this was helpful? Linda ---orig msg follows--- Theo Van Dinter wrote: On Sun, Dec 24, 2006 at 05:43:12PM -0800, Linda Walsh wrote: I've seen this error message in the past few upgrades (~3.11, .12, .17) and was wondering if anyone else has seen it and knows what the problem is. Discussed so much it's an FAQ. :
Re: Warning: "xxx" matches null string many times in regex in Text/Wrap.pm..
Many thanksdidn't think to look in the FAQ...sigh. I have "local site configuration" esteem issues -- thinking it is usually something "peculiar" to my setup. So it's Text::Wrap... I'm surprised they haven't fixed it. Doesn't seem like it would be that difficult as they should have a fairly large number of "test cases" and should know what they changed... (famous last words). -Linda Theo Van Dinter wrote: On Sun, Dec 24, 2006 at 05:43:12PM -0800, Linda Walsh wrote: I've seen this error message in the past few upgrades (~3.11, .12, .17) and was wondering if anyone else has seen it and knows what the problem is. Discussed so much it's an FAQ. :) http://wiki.apache.org/spamassassin/TextWrapError
Warning: "xxx" matches null string many times in regex in Text/Wrap.pm..
I've seen this error message in the past few upgrades (~3.11, .12, .17) and was wondering if anyone else has seen it and knows what the problem is. --- Dec 24 17:32:53 mailhost spamd[3320]: (?:(?<=[\s,]))* matches null string many times in regex; marked by <-- HERE in m/\G(?:(?<=[\s,]))* <-- HERE \Z/ at /usr/lib/perl5/5.8.8/Text/Wrap.pm line 47. --- I'm guessing some configuration is messed up somewhere, but I suppose it could be a bug in the Text/Wrap module. I've just checked to see that my cpan modules are up-to-date, and any with version numbers are. Any ideas on getting rid of this message (preferably by removing the cause, not by covering it up...:-)). Thanks, Linda
light-grey listing..? lkml filter probs & catching too much ham.
I'm having problems filtering a list I'm on (lkml). First I had it on normal filter -- but I had too many false positives. Finally switched it to a white-list, but now, many true negatives (spam) get through. Is there a way to "light-grey" a list -- not a blanket accept all, white-list, but something that temporarily moves the spam-"high-water" mark for that specific email: i.e. instead of it taking "X" points to be marked as SPAM, it adds 5-points to the threshhold needed to mark the message as spam? I heard that the list owners attempted to tighten the filters and had the same problem -- too many "ham" emails got trapped. Perhaps it is all the code that gets published to that list? Dunno, but something seems in common with SPAM and, maybe, code (or at least the normal linux-kernel-mailing-list "post") that is making it a hard list to "police" ("clean") up. Anyone else have stubborn lists like this or had successes in filtering lkml? I even split off "code-ish" looking posts to a separate folder, but that still didn't stop the false negatives, so not quite sure what makes such a list uniquely difficult to filter. Not the worse problem -- at least it's confined to that folder, but the various spams that are present make it a bit challenging to read -- right in the middle of the tech stuff...just on the first page of titles (conversations hidden under titles), 2/10 titles are sex related spams. It's a bit annoying to read through (sigh). Now why would sex-spammers target lkml-readers. Do they think lkml-readers are uniquely more likely to respond to sex-spam? (Maybe, given the fascination of the average "/." reader and their amusement with "pr0n", there could be some basis to the spammer's methods...?)... thanks, -linda
new problem after upgrade perl modeul to 3.1.4(from 3.1.2)
I just updated to a newer version of spamassin a few days ago. Since then I'm getting regular error messages in my spamlog: Sep 2 03:46:03 Ishtar spamd[13106]: (?:(?<=[\s,]))* matches null string many times in regex; marked by <-- HERE in m/\G(?:(?<=[\s,]))* <-- HERE \Z/ at /usr/lib/perl5/5.8.8/Text/Wrap.pm line 46. Sep 2 03:49:04 Ishtar spamd[13087]: (?:(?<=[\s,]))* matches null string many times in regex; marked by <-- HERE in m/\G(?:(?<=[\s,]))* <-- HERE \Z/ at /usr/lib/perl5/5.8.8/Text/Wrap.pm line 46. Sep 2 03:52:02 Ishtar spamd[13443]: (?:(?<=[\s,]))* matches null string many times in regex; marked by <-- HERE in m/\G(?:(?<=[\s,]))* <-- HERE \Z/ at /usr/lib/perl5/5.8.8/Text/Wrap.pm line 46. ...etc..etc... Am I missing some needed configuration somewhere, or is the above a problem? It seems to be happening with every message. Um...is this like "unsolicited reporting of a bogus condition" and would it fall into syslog-spam? :-) tnx, Linda
Re: "non-fuzzy body parts in subject": missed
Matt Kettler wrote: Yes it does.. the text of the subject line will match against any body rule. SA pre-pends this so we don't have to have a massive duplication of rules to cover both body and subject. --- Ah. Didn't know that. Different tools, different lingo for message, message header, message body. "Want a Bigger MBP?" A '25_replace' rule is present for "fuzzy" MBP's, but doesn't seem to catch unfuzzy ones. So I guess questions might be: 1) should 'fuzzy' rules match non-fuzzy targets as well as fuzzy ones? IMHO, no. I think there should be two rules with separate scores. In the above example the scores would be pretty much the same. --- I agree on keeping the rules separate, just didn't know fuzzy Subj was included in body. However consider the word viagra, an obfuscation is a clear sign of spam. Un-obfuscated is a less strong sign of spam in this case, because it could be a joke or a conversation with a medical discussion of some form. --- Agreed. Should it, or rather, do people feel this is a good idea? I don't feel that would be a good idea. Bear in mind this would also make a "good" message (ie: one at -1.0) be "more good". It just doesn't make sense to me to have something which merely acts as a "score amplifier" instead of a score adjustment. --- I realized it would increase "goodness" as well, but I guess I didn't see that as much of an issue of the multiplier was applied last. Performing any kind of GA to establish a reasonable multiplier value for these would be a logistical nightmare. --- :-) True, but that doesn't mean SA couldn't "support" a post multiplier! :-) I can see it's use would be somewhat limited though, as I'm not sure under what other conditions one would want such a scaling, so its loss in "one" circumstance seems minor. Sometimes I get overfocused on the problem, and blow up its severity, in my mind. Uh, maybe I can blame it on original spam's intent on increasing small problems? ;^? Feedback is good! :-) Tnx, Linda
"non-fuzzy body parts in subject": missed
I have been receiving a spate of short messages that don't seem to trigger enough default rules to be knocked out. I was investigating and noticed a discrepancy [bug?] in the rules. One particular email refers to the uniquely Male-Body-Part starting w/"P", let's call MBP for purposes discussion. It gets hit by a '20' rule for body parts in the message body, but I noticed it doesn't get anything for the subject: "Want a Bigger MBP?" A '25_replace' rule is present for "fuzzy" MBP's, but doesn't seem to catch unfuzzy ones. So I guess questions might be: 1) should 'fuzzy' rules match non-fuzzy targets as well as fuzzy ones? 2) Should there be some "normalization" adjustment for short messages? I'm thinking a "scale factor" rather than an absolute score to add, -- reflecting the general idea that short messages are not bad, but if you are scoring on the "bad" side, a multiplier (ex. 1.1 or 1.2) would increase the score of a message that is already being sized up as "bad". Does SA support any multiplier type rules? Should it, or rather, do people feel this is a good idea? i.e.: RULENAME *1.1 (0,*1.1,0,*1.1) type format? -l
Re: spamcop.net tactics
That doesn't mean it's a moral, an ethical or respectable reason: "Spite" is reason enough for most people these days. Michele Neylon:: Blacknight.ie wrote: if your IPs end up in there it's usually for a reason. Michele
when to SQL; RFE's (to dev?)
Michael Monnerie wrote: On Samstag, 29. Oktober 2005 06:33 Linda Walsh wrote: Assuming it is some sort of berkeley db format, what is a good cut-over size as a "rule-of-thumb"...or is there? What should I expect in speeds for "sa-learn" or spamc? I.e. -- is there a rough guideline for when it becomes more effective to use SQL vs. the Berkeley DB? Or rephrased, when it is worth the effort to convert to SQL and ensure all the SQL software is setup and running? I don't know whether this really is a performance question, but I believe it's more of a "do I need it" question. For example, if you use a system wide bayes db, you probably won't need SQL. I do this for now. --- Still am not sure what size system (or user) db's should trigger usage of "SQL". Any reason why user DB's would hurt performance over a system DB using Berkeley format? Supposing I have no system DB and am only using user DB's? What if it is a small group 3-4 people? Is it an issue of having to read in the DB with each email / user and the system DB might hang around in memory? Does the system DB get some preferential treatment? I.e. if one user gets 80% of the email, will SA operate as though it is using a system DB? Still not so sure about why "sa-learn" would process emails so much more slowly than 2.6x, since for an individual user, it wouldn't be accessing a system DB, no? But if some users want/need their own bayes, or own settings, it starts becoming easier to use SQL for all that things - it's quickly becoming easier to manage, after 5 users or so need their special config. That's why I'm thinking of switching to SQL. Does anybody know whether MySQL or PostgreSQL is better suited for the job? I prefer PostgreSQL, but many times MySQL is better supported... mfg zmi
3.1 vs. 2.6x & 3.0x: Good; when to SQL; RFE's (to dev?)
Finally got the kinks worked out in my SA-3.1 setup last week. Filtered out over 420 spams -- maybe 1 false positive, and it was borderline. The speed on sa-learn has dropped, but that may be unavoidable. But I'm finally getting >= spam recognition than I had in 2.63. I have no-online tests enabled as the online test databases are going the way of "cddb"...becoming privatized. Sorta sad...maybe time to start a "freezor" or some similar services. I mean the spam services collect data about what is spam from users who use the database. Without the users, they woudn't be nearly as effective. Yet the users then are encouraged to pay to access the body of data that was previously donated for free. I suppose one could look at the cost of "aggregation" and intelligent processing of 1000's of user-spam inputs into a usable output format, and while it might be manageable for a small community of users, it's not so manageable if the database starts being used by a much larger user-base than the original system was designed to run on. Still -- I have yet to look at what is needed to convert my "db"s into SQL form -- been sorta busy: car got crashed into last week and was told this week it's totalled, that and was informed Tuesday of a need for a root canal, on Wednesday, informed of need for 2nd root canal & oral surgery. *smile* Life is just so _*!%fun!*%)_. So am a bit behind in being on top of my ->SQL based conversion (I'm assuming i'm in an older format. I just ran the convert tool to convert from 2.x format to 3.x. Assuming it is some sort of berkeley db format, what is a good cut-over size as a "rule-of-thumb"...or is there? What should I expect in speeds for "sa-learn" or spamc? I.e. -- is there a rough guideline for when it becomes more effective to use SQL vs. the Berkeley DB? Or rephrased, when it is worth the effort to convert to SQL and ensure all the SQL software is setup and running? Thanks...and thanks for the help/patience BTW -- maybe this should go to the "sa-dev" list, but an RFE: "spamassassin --lint": 1) would be nice to mention if daemon is _RUNNING_ and ready to process messages; (user error: forgetting to restart daemon and seeing no "--lint" message hinting that the daemon isn't running and ready to process incoming mail--*duh*) 2) Would be nice, especially in "--lint" to check for bogus lock files left around in spam DB dir. I don't know when these files are used, but their presence really slows down sa-learn by about a factor of 4-6x. "sa-learn": 1) RFE: have sa-learn issue warning about pre-existing lock-files, or, better, auto-remove bogus locks for processes that no longer exist.
Re: SA 3.04: high fail rate; X-SA-no-reject?; more details.
Loren Wilton wrote: If you are only correctly classifying 50% of the spam (you said 100 caught to 100 missed, I htink) then you have SERIOUS problems of some sort. Yeah, well, I try not to be too reactionary on computer things like this -- especially when it could just be a matter of flipping a config switch somewhere and things get instantly better. While the number of spams getting through are significantly higher, probably 75-80% of them are duplicate emails sent to multiple email addresses -- including some blacklisting To-Addresses. Apparently, the spammer isn't being kind enough to send the spam to the black-listed To-Add'ies first and with the new spamc client, sendmail notices the lower load average and likely allows more parallel incoming instances to process incoming email before a given spam gets "locked out". I suppose this could be a "downside" of this efficiency, but previous to this I never saw multiple instances of these simple spams get through **undetected**. This makes me think it isn't just the increased efficiency causing problems as I would have expected at least one or two duplicate spams that wouldn't have been caught by "other means" (than being sent to a blacklisted To-addr). As a happy 2.63 user that upgraded to 3.04, it too a little minor fiddling, but by and large things are *much* better now, and they were good before. -- *(oh the salt, the salt [in the wound]...:-) )* --- Also, you mentioned training with 'old spam' and 'new ham'. Presumably you were talking about bayes training. Really training with new spam, especially the stuff slipping through, would be the right thing to do. Spam has changed considerably in character in just the last 6 months. Sorry, unclear: I archive current spams after "sa-learn"ing on them, so "archives" contain anything older than whatever I haven't processed "recently". With SA 2.63, I'd go through my Junk email folder sometimes as infrequently as once/month and find maybe 6-10 emails that should have gone to subscribed lists or where from recent online vendors that sent me spammy looking receipts (although those were rare). I'd drop them in my "despam" folder for later "ham learning". But sifted folders of junk email, I process(sa-learn-junk) in bulk and archive. Suggestion: let us see the full list of SA hits on some of the stuff slipping through. The full list of SA hits? -- for that message, that was it, here's another passed. Note, there is a weird header "X-SA-Do-Not-Rej: Yes" which doesn't look normal: ---junk email that passed as ham; sent to multiple email accounts--- Received: (qmail 16547 invoked from network); 16 Sep 2005 18:08:51 - Received: from unknown (HELO thaimail.org) ([202.150.81.42]) (envelope-sender <[EMAIL PROTECTED]>) by mail7.sea5.speakeasy.net (qmail-ldap-1.03) with SMTP for <[EMAIL PROTECTED]>; 16 Sep 2005 18:08:49 - From: "Molnar Chris" <[EMAIL PROTECTED]> To: "Siedler Clemens" <[EMAIL PROTECTED]> Subject: Re[6]: Date: Fri, 16 Sep 2005 18:09:04 + Message-ID: <[EMAIL PROTECTED]> X-SA-Do-Not-Rej: Yes MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="=_NextPart_000_40CE_1F627A89.B53D40CE" X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2900.2527 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.2527 X-Spam-DCC: : X-Spam-Checker-Version: SpamAssassin 3.0.4 (2005-06-05) on ishtar.sc.tlinx.org X-Spam-Level: *** X-Spam-Status: No, score=3.5 required=4.8 tests=BAYES_99,HTML_MESSAGE autolearn=no version=3.0.4 X-Spam-Pyzor: X-Spam-Report: * 0.0 HTML_MESSAGE BODY: HTML included in message * 3.5 BAYES_99 BODY: Bayesian spam probability is 99 to 100% * [score: 1.] Status: X-Status: X-Keywords: Junk In a rejected email, I see many more tests: --junk email correctly labeled-- Subject: ***SPAM*** Athena, Electric-chair for little or no-cost MIME-Version: 1.0 X-Mailid: 6977 Content-Type: multipart/alternative; boundary="==8aa9d3a4cb398b" Date: Thu, 15 Sep 2005 14:56:00 -0700 X-Spam-Prev-Subject: Athena, Electric-chair for little or no-cost X-Spam-DCC: : X-Spam-Flag: YES X-Spam-Checker-Version: SpamAssassin 3.0.4 (2005-06-05) on ishtar.sc.tlinx.org X-Spam-Level: ** X-Spam-Status: Yes, score=6.9 required=4.8 tests=BAYES_99,HTML_90_100, HTML_IMAGE_ONLY_20,HTML_IMAGE_RATIO_02,HTML_MESSAGE,HTML_WEB_BUGS, MIME_HTML_MOSTLY,MPART_ALT_DIFF,MSGID_FROM_MTA_HEADER, MSGID_FROM_MTA_ID autolearn=no version=3.0.4 X-Spam-Pyzor: X-Spam-Report: * 1.7 MSGID_FROM_MTA_ID Message-Id for external message added locally * 0.4 HTML_IMAGE_ONLY_20 BODY: HTML: images with 1600-2000 bytes of wor ds * 0.0 HTML_IMAGE_RATIO_02 BODY: HTML has a low ratio of text to image a rea
SpmAssn 3.04 v. 2.6x false negative rate: Help???
Ever since I "upgraded" to the 3.x series I've had a major jump in spams that are getting through. Initially my upgrade was to 3.02 as distributed in SuSE 9.3 and my problems were related to old configuration files/options where NONE of my spam was being tagged into the spam folder (i.e. the SPAM header wasn't set in the subject, as my filtering system makes use of). I've gotten all of the "lint" out of my config files, ported my old DB to the new format, and even ran the learning mechanism over several old "SPAM" archives (~150Mb) and current "HAM" input folders ~100Mb. About 100 spams a day are getting through and requiring manual processing with about 100/day being correctly filtered into the spam folder. That's a huge drop in detected spams. I've tried dialing down the threshold from the default to my previous 5, then to 4.8... not wanting to be overly aggressive. But I'm wondering if the default weightings for various tests have been changed between the 2.6x and 3.0x series. I note a new 3.1.0 release, but noticed no improvement going from 3.02 to 3.04. It _seems_ like, maybe, some of the weightings of the various tests changed which is throwing off the classifier. I'll see multiple instances of various, identical spams going to different email addresses on my server -- most often with "Subject: Re[]:", where x=[0-9]. They are the most numerous offenders as they'll come in to multiple accounts at nearly the same time (or a few seconds apart). One copy of those messages will result in duplicate spam being sent to several accounts, and my multiple personalities, er, um, "users" :-), are getting annoyed with me. Also of note: "sa-learn" is MUCH slower in 3.0.x than it was in 2.6.x though with the compiled "spamc" client, I can see that the processing of incoming spam is handled with a lower load on the server. One voice in my head says, screw it, stop your whining and go back to what worked (2.6x), but another part of me says "3.x" is where the future is, and if there is a problem in my setup, I should take the time to figure out what the problem is and try to make it work. Looking at a partial header of one note: X-Spam-Report: * 0.0 HTML_MESSAGE BODY: HTML included in message * 3.5 BAYES_99 BODY: Bayesian spam probability is 99 to 100% * [score: 1.] Content was multi-part message in MIME format, same messsage in plain and HTML text: "" -- Content involved advertising product to increasing one's chance of producing offspring via chance encounters with receptive female partners. Is 5.0 too high a default in 3.x, though I would have expected it to count a little bit more for an HTML message... Ooops another batch of 80+ just came inSA tastes great, less filling! re: first posting attempt: <<< 552 spam score (9.1) exceeded threshold on a list to designed to talk about a tool to detect such spam> And ironically, the irony of this restriction may never be known if this note never makes it to the list...;-/. Sigh, Linda