Re: Those "Re: good obfupills" spams (bayes scores)
From: "Bart Schaefer" <[EMAIL PROTECTED]> On 4/29/06, Matt Kettler <[EMAIL PROTECTED]> wrote: In SA 3.1.0 they did force-fix the scores of the bayes rules, particularly the high-end. The perceptron assigned BAYES_99 a score of 1.89 in the 3.1.0 mass-check run. The devs jacked it up to 3.50. That does make me wonder if: 1) When BAYES_9x FPs, it FPs in conjunction with lots of other rules due to the ham corpus being polluted with spam. My recollection is that there was speculation that the BAYES_9x rules were scored "too low" not because they FP'd in conjunction with other rules, but because against the corpus they TRUE P'd in conjunction with lots of other rules, and that it therefore wasn't necessary for the perceptron to assign a high score to BAYES_9x in order to push the total over the 5.0 threshold. The trouble with that is that users expect training on their personal spam flow to have a more significant effect on the scoring. I want to train bayes to compensate for the LACK of other rules matching, not just to give a final nudge when a bunch of others already hit. I filed a bugzilla some while ago suggesting that the bayes percentage ought to be used to select a rule set, not to adjust the score as a component of a rule set. << jdow >> There is one other gotcha. I bet vastly different scores are warranted for Bayes when run with per user training and rules as compared to global training and rules. {^_^}
Re: Those "Re: good obfupills" spams (bayes scores)
From: "Matt Kettler" <[EMAIL PROTECTED]> Bart Schaefer wrote: On 4/29/06, Matt Kettler <[EMAIL PROTECTED]> wrote: Besides.. If you want to make a mathematics based argument against me, start by explaining how the perceptron mathematically is flawed. It assigned the original score based on real-world data. Did it? I thought the BAYES_* scores have been fixed values for a while now, to force the perceptron to adapt the other scores to fit. Actually, you're right..I'm shocked and floored, but you're right. In SA 3.1.0 they did force-fix the scores of the bayes rules, particularly the high-end. The perceptron assigned BAYES_99 a score of 1.89 in the 3.1.0 mass-check run. The devs jacked it up to 3.50. That does make me wonder if: 1) When BAYES_9x FPs, it FPs in conjunction with lots of other rules due to the ham corpus being polluted with spam. This forces the perceptron to attempt to compensate. (Pollution always is a problem since nobody is perfect, but it occurs to differing degrees). -or- 2) The perceptron is out-of whack. (I highly doubt this because the perceptron generated the ones for 3.0.x and they were fine) -or- 3) The Real-world FPs of BAYES_99 really do tend to also be cascades with other rules in the 3.1.x ruleset, and the perceptron is correctly capping the score. This could differ from 3.0.x due to change in rules, or change in ham patterns over time. -or- 4) one of the corpus submitters has a poorly trained bayes db. (possible, but I doubt it) Looking at statistics-set3 for 3.0.x and 3.1.x there was a slight increase in ham-hits for BAYES_99 and a slight decrease in spam hits. 3.0.x: OVERALL% SPAM% HAM% S/ORANK SCORE NAME 43.515 89.3888 0.0335 1.000 0.83 1.89 BAYES_99 3.1.x: OVERALL% SPAM% HAM% S/ORANK SCORE NAME 60.712 86.7351 0.0396 1.000 0.90 3.50 BAYES_99 Also to consider is set3 of 3.0.x was much closer to a 50/50 mix of spam/nonspam (48.7/51.3) than 3.1.0 was (nearly 70/30) What happens comes from the basic reality that Bayes and the other rules are not orthogonal sets. So many other rules hit 95 and 99 that the perceptron artificially reduced the goodness rating for these rules. It needs some serious skewing to catch situations where 95 or 99 hit and very few other rules hit. Those are the times the accuracy of Bayes is needed the most. I've found, here, that 5.0 is a suitable score. I suspect if I were more realistic 4.9 would be closer. But I still do remember learning the score bias and being floored by it when I noticed 99 on some spams that leaked through with ONLY the 99 hit. I am speaking of dozens of spams hit that way. So far over several years I've found a few special cases that warrant negative rules. That seems to be pulling the 99 rule's false alarm rate down to "I can't see it." (I have, however, been tempted to generate a BAYES_99p5 rule and a BAYES_99p9 rule to fine tune the scores up around 4.9 and 5.0.) {^_
Re: Those "Re: good obfupills" spams
From: "Matt Kettler" <[EMAIL PROTECTED]> List Mail User wrote: Matt Kettler replied: John Tice wrote: Greetings, This is my first post after having lurked some. So, I'm getting these same "RE: good" spams but they're hitting eight rules and typically scoring between 30 and 40. I'm really unsophisticated compared to you guys, and it begs the question––what am I doing wrong? All I use is a tweaked user_prefs wherein I have gradually raised the scores on standard rules found in spam that slips through over a period of time. These particular spams are over the top on bayesian (1.0), have multiple database hits, forged rcvd_helo and so forth. Bayesian alone flags them for me. I'm trying to understand the reason you would not want to have these type of rules set high enough? I must be way over optimized––what am I not getting? BAYES_99, by definition, has a 1% false positive rate. If we were to presume a uniform distribution between a estimate of 99% and 100%, then the FP rate would be .5%, not 1%. You're right Paul, my bad.. But again, I don't care if it's 0.01%. The question here is "is jacking up the score of BAYES_99 to be greater than required_hits a good idea". The answer is "No, because BAYES_99 is NOT a 100% accurate test. By definition it does have a non-zero FP rate. I run AT 5.0. When I see my first false alarm solely from BAYES_99 I will reduce it slightly. I know what theory says. I also know that BAYES_99 alone captures more spam than it has ever captured ham for false imprisonment. And for large sites (i.e. 10s or thousands or messages a day or more), this may be what occurs; But what I see and what I assume many other small sites see is a very much non-uniform distribution; From the last 30 hours, the average estimate (re. the value reported in the "bayes=xxx" clause) for spam hitting the BAYES_99 rule is .41898013269 with about two thirds of them reporting bayes=1 and a lowest value of bayes=0.998721756590216. Yes, that's to be expected with Chi-Squared combining. While SA is quite robust largely because of the design feature that no single reason/cause/rule should by itself mark a message as spam, I have to guess that the FP rate that the majority of users see for BAYES_99 is far below 1%. From the estimators reported above, I would expect that I would have seen a .003% FP rate for the last day plus a little, if only I received 100,000 or so spam messages to have been able to see it:). True, but it's still not nearly zero. Even in the corpus testing, which is run by "the best of the best" in SA administration and maintenance, BAYES_99 matched 0.0396% of ham, or 21 out of 53,091 hams. (Based on set-3 of SA 3.1.0) And it is scored LESS than BAYES_95 by default. That's a clear signal that the theory behind the scoring system is a little skewed and needs some rethinking. Given we are dealing with user who doesn't even understand why you might not want this set "high enough", I would expect the level of sophistication in bayes maintenance Besides.. If you want to make a mathematics based argument against me, start by explaining how the perceptron mathematically is flawed. It assigned the original score based on real-world data. Not our vast over simplifications. You should have good reason to question its design before second guessing it's scoring based on speculation such as this. When it can give BAYES_99 a score LOWER than BAYES_95 it clearly has a conceptual problem. (It also indicates that automatic Bayes filter training has its own conceptual flaws.) I don't change the scoring from the defaults, but if people were to want to, maybe they could change the rules (or add a rule) for BAYES_99_99 which would take only scores higher than bayes=. and which (again with a uniform distribution) have an expected FP rate of .005% - than re-score that just closer (but still less) than the spam threshold, I'd agree.. However, the OP has already made BAYES_99 > required_hits. Bad idea. Period. 5.0 is, admittedly marginal. 6 or 7 is not a good idea. Not enough rules exist that will pull it back down. (Thinking on that I suspect there are some SARE rules that should lower the score slightly when they are not hit.) {^_^}
Re: Tracking Compound Meta's
On Fri, 28 Apr 2006, Dan wrote: > > It looks like it might have some interesting purposes. But for the > > most part, I can't think of what you would use it for. I can't > > think of a single example where SARE could have used this before. > > Actually, the way I expect to use it is more like: > > __test [A1 - A3] > __test [B1 - B3] > __test [C1 - C3] > __test [D1 - D3] > > meta __META_A (__testA1 || __testA2 || __testA3) [snip..] > Still pretty new to SA, I'm in the middle of building my system and > was hoping to find preexisting features I could simply build my > configuration around. If micro weighting (.001) doesn't work, I'll > make a feature request after deciding the best way to do what I'm > after. Thinking about it today, my ideal would be: > > 1) An option to turn off scoring for specific tests WITHOUT turning > off its event reporting. Perhaps a different prefix, like ++test > instead of __test. > > AND > > 2) A logging system that records EVERY test involved for EVERY > message scanned, that also allows me to locate the correct entry > (with a text editor) when all I have is the Subject: or From: of a > given message. What about using the SA 'test rule' mechanism? (IE use "T_testA1" rather than "__testA1"). Effectivly the micro weighting done automagically and in a standardized way. Read the SA conf documentation for details. -- Dave Funk University of Iowa College of Engineering 319/335-5751 FAX: 319/384-0549 1256 Seamans Center Sys_admin/Postmaster/cell_adminIowa City, IA 52242-1527 #include Better is not better, 'standard' is better. B{
Re: Those "Re: good obfupills" spams (bayes scores)
On 4/29/06, Matt Kettler <[EMAIL PROTECTED]> wrote: In SA 3.1.0 they did force-fix the scores of the bayes rules, particularly the high-end. The perceptron assigned BAYES_99 a score of 1.89 in the 3.1.0 mass-check run. The devs jacked it up to 3.50. That does make me wonder if: 1) When BAYES_9x FPs, it FPs in conjunction with lots of other rules due to the ham corpus being polluted with spam. My recollection is that there was speculation that the BAYES_9x rules were scored "too low" not because they FP'd in conjunction with other rules, but because against the corpus they TRUE P'd in conjunction with lots of other rules, and that it therefore wasn't necessary for the perceptron to assign a high score to BAYES_9x in order to push the total over the 5.0 threshold. The trouble with that is that users expect training on their personal spam flow to have a more significant effect on the scoring. I want to train bayes to compensate for the LACK of other rules matching, not just to give a final nudge when a bunch of others already hit. I filed a bugzilla some while ago suggesting that the bayes percentage ought to be used to select a rule set, not to adjust the score as a component of a rule set.
Re: Those "Re: good obfupills" spams (bayes scores)
Bart Schaefer wrote: > On 4/29/06, Matt Kettler <[EMAIL PROTECTED]> wrote: >> Besides.. If you want to make a mathematics based argument against me, >> start by explaining how the perceptron mathematically is flawed. It >> assigned the original score based on real-world data. > > Did it? I thought the BAYES_* scores have been fixed values for a > while now, to force the perceptron to adapt the other scores to fit. > Actually, you're right..I'm shocked and floored, but you're right. In SA 3.1.0 they did force-fix the scores of the bayes rules, particularly the high-end. The perceptron assigned BAYES_99 a score of 1.89 in the 3.1.0 mass-check run. The devs jacked it up to 3.50. That does make me wonder if: 1) When BAYES_9x FPs, it FPs in conjunction with lots of other rules due to the ham corpus being polluted with spam. This forces the perceptron to attempt to compensate. (Pollution always is a problem since nobody is perfect, but it occurs to differing degrees). -or- 2) The perceptron is out-of whack. (I highly doubt this because the perceptron generated the ones for 3.0.x and they were fine) -or- 3) The Real-world FPs of BAYES_99 really do tend to also be cascades with other rules in the 3.1.x ruleset, and the perceptron is correctly capping the score. This could differ from 3.0.x due to change in rules, or change in ham patterns over time. -or- 4) one of the corpus submitters has a poorly trained bayes db. (possible, but I doubt it) Looking at statistics-set3 for 3.0.x and 3.1.x there was a slight increase in ham-hits for BAYES_99 and a slight decrease in spam hits. 3.0.x: OVERALL% SPAM% HAM% S/ORANK SCORE NAME 43.515 89.3888 0.0335 1.000 0.83 1.89 BAYES_99 3.1.x: OVERALL% SPAM% HAM% S/ORANK SCORE NAME 60.712 86.7351 0.0396 1.000 0.90 3.50 BAYES_99 Also to consider is set3 of 3.0.x was much closer to a 50/50 mix of spam/nonspam (48.7/51.3) than 3.1.0 was (nearly 70/30)
Re: Those "Re: good obfupills" spams
On 4/29/06, Matt Kettler <[EMAIL PROTECTED]> wrote: Besides.. If you want to make a mathematics based argument against me, start by explaining how the perceptron mathematically is flawed. It assigned the original score based on real-world data. Did it? I thought the BAYES_* scores have been fixed values for a while now, to force the perceptron to adapt the other scores to fit.
Re: OT spammers
Igor Chudov wrote: > Here's something that I do not understand. What is the point of > spamming people repeatedly not once, twice, or even 10 times, but > hundreds of times. If I wanted to procure pils, or pgrn, or whatever, > I would have done it on the first 10 spams. After 100 or so spams, > what is the benefit of sending me yet more spam? I seem to receive > some spams, such as about getting fake education, way over 100 times. Because it works. Scary to think that some people are that stupid. david
Re: Those "Re: good obfupills" spams
On 4/29/06, List Mail User <[EMAIL PROTECTED]> wrote: While SA is quite robust largely because of the design feature that no single reason/cause/rule should by itself mark a message as spam, I have to guess that the FP rate that the majority of users see for BAYES_99 is far below 1%. Anyway, to better address the OP's questions: The system is more robust if instead of changing the weighting of existing rules (assuming that they were correctly established to begin with), you add more possible inputs Exactly. For example, I find that anything in the subset consisting of messages that don't mention my email address anywhere in the To/Cc headers and also scoring above BAYES_70 has close to 100% likelyhood of being spam. However, since I also get quite a lot of mail that doesn't fall into that subset, I can't simply increase the scores for the BAYES rules. In this case I use procmail to examine the headers after SA has scored the message, but I've been considering creating a meta-rule of some kind. Trouble is, SA doesn't know what "my email address" means (it'd need to be a list of addresses), and I'm reluctant to turn on allow_user_rules.
Re: SA & Razor problem - help requested
On Sat, Apr 29, 2006 at 01:07:28PM -0400, Theo Van Dinter wrote: > On Sat, Apr 29, 2006 at 06:16:36PM +0200, Rainer Sokoll wrote: > > loadplugin Mail::SpamAssassin::Plugin::Razor2 > > don't do that in a cf file.. Moved to init.pre > What does the output from: > > spamassassin --lint -D razor2 > > look like? dbg: razor2: razor2 is available, version 2.81 Nothing else :-( Rainer
Re: SA & Razor problem - help requested
On Sat, Apr 29, 2006 at 06:16:36PM +0200, Rainer Sokoll wrote: > loadplugin Mail::SpamAssassin::Plugin::Razor2 don't do that in a cf file.. > Any suggestions? What does the output from: spamassassin --lint -D razor2 look like? -- Randomly Generated Tagline: "What is a lie but the truth in masquerade." - Byron pgprXrmwcKf1B.pgp Description: PGP signature
KMail and spamassassin question
Hi, I run Gentoo linux and kde 3.5.2 with kmail Currently I have configured and installed SpamAssassin version 3.1.0 I configured SA to run as demone against KMail running as plug-in. So, anytime I receive mail through KMail, SA filters all mail. I have few questions reguarding how SA filters mail and how it integrates with KMail. I would appreciate some explanations if possible, as already asked at kmail mailing list but not able to provide me with answers. 1) after I installed SA, do I / is it advised to install rules already made, like spamassassin-ruledujour, and more like them? If so, how to implement them? Just install them and that is all? 2) I have some recurrent spam mail that even if I train SA they are not filtered. What to do? 3) Are there different ways of setting up SA and KMail? 4) How to create a whitelist with SA and KMail where I can see a list of all my white list members? 5) Any good book to purchase to learn SA properly? Thank you, Spiro
Re: SA & Razor problem - help requested
On Sat, Apr 29, 2006 at 10:39:48AM -0400, Theo Van Dinter wrote: > the third thing in the UPGRADE doc: > > - Due to license restrictions the DCC and Razor2 plugins are disabled > by default. [...] OK, in my local.cf I have: loadplugin Mail::SpamAssassin::Plugin::Razor2 ifplugin Mail::SpamAssassin::Plugin::Razor2 use_razor2 1 razor_config /home/vscan/.razor/razor-agent.conf endif A test gives me: [EMAIL PROTECTED]:~> /usr/local/perl-5.8.8/bin/spamassassin -D --lint \ --config-file=/tmp/local.cf 2>&1 | grep -i razor [25516] dbg: diag: module installed: Razor2::Client::Agent, version 2.81 [25516] dbg: plugin: loading Mail::SpamAssassin::Plugin::Razor2 from @INC [25516] dbg: razor2: razor2 is available, version 2.81 [25516] dbg: plugin: registered Mail::SpamAssassin::Plugin::Razor2=HASH(0x90f8dec) [25516] dbg: plugin: loading Mail::SpamAssassin::Plugin::Razor2 from @INC [25516] dbg: razor2: razor2 is available, version 2.81 [25516] dbg: plugin: did not register Mail::SpamAssassin::Plugin::Razor2=HASH(0x8fc5f1c), already registered [EMAIL PROTECTED]:~> Beside the fact that Razor2 seems to be loaded two times: I expect to see something similar to http://wiki.apache.org/spamassassin/RazorHowToTell? If I do a razor-check manually, razor seems to work fine. Any suggestions? Thank you, Rainer
Re: SA & Razor problem - help requested
Theo, Thanks for this. Now I feel stubid for bother the list. I have been running SA for some time, and didn't notice that change. My bad. Thanks for the quick reply! Dave On Sat, 29 Apr 2006 10:39:48 -0400, Theo Van Dinter wrote > On Sat, Apr 29, 2006 at 08:58:42AM -0400, David Flanigan wrote: > > (http://www.flanigan.net/spam) seen even a single RAZOR hit. However, I get > > no errors > > in the error logs. The only error I see is on a `spamassassin lint` which > > says: > > > > [8611] warn: config: failed to parse line, skipping: use_razor2__1 > > [8611] warn: config: failed to parse line, skipping: razor_config > > __/etc/mail/spamassassin/.razor/razor-agent.conf > > enable the razor plugin in v310.pre. > > > Oddly, I get the exact same symptoms with DCC. > > ditto. > > > I have searched the mailing list, and and I followed the wiki guide at > > spamassassin.apache.org for installing razor with SA. I have verifed that > > both SA and > > Razor work on there own, and have fed razor several test messages. SA works > > fine other > > than the razor problems. > > the third thing in the UPGRADE doc: > > - Due to license restrictions the DCC and Razor2 plugins are disabled > by default. [...] > > -- > Randomly Generated Tagline: > "... then you'll excuse me, but I'm in the middle of fifteen things, all of > them annoying." > - Ivonova, Babylon 5 (Midnight on the Firing Line) --- Kind Regards, David http://www.flanigan.net
Re: SA & Razor problem - help requested
On Sat, Apr 29, 2006 at 08:58:42AM -0400, David Flanigan wrote: > (http://www.flanigan.net/spam) seen even a single RAZOR hit. However, I get > no errors > in the error logs. The only error I see is on a `spamassassin lint` which > says: > > [8611] warn: config: failed to parse line, skipping: use_razor2__1 > [8611] warn: config: failed to parse line, skipping: razor_config > __/etc/mail/spamassassin/.razor/razor-agent.conf enable the razor plugin in v310.pre. > Oddly, I get the exact same symptoms with DCC. ditto. > I have searched the mailing list, and and I followed the wiki guide at > spamassassin.apache.org for installing razor with SA. I have verifed that > both SA and > Razor work on there own, and have fed razor several test messages. SA works > fine other > than the razor problems. the third thing in the UPGRADE doc: - Due to license restrictions the DCC and Razor2 plugins are disabled by default. [...] -- Randomly Generated Tagline: "... then you'll excuse me, but I'm in the middle of fifteen things, all of them annoying." - Ivonova, Babylon 5 (Midnight on the Firing Line) pgpsJdJGsNHKH.pgp Description: PGP signature
Re: Those "Re: good obfupills" spams
Thank you all for the comments. My personal experience is that Bayes_99 is amazingly reliable––close to 100% for me. I formerly had it set to 4.5 so that bayes_99 plus one other hit would flag it, but then I started getting some spam that were not hit by any other rule, yet bayes correctly identified them. It seems more effective to write some negative scoring ham rules specific to my important content rather than to take less than full advantage of the high accuracy of bayes. And, the spams in question in this thread are hitting multiple rules, so should be catchable without having bayes_99 set over the top. I suppose all these judgments must take into account one's preferences, degree of aversion to FPs, and the diversity of content you're working with. Hopefully I will improve accuracy by writing/ adding custom rules and be able to back off the scoring of standard rules, but I have been fairly successful (by my own definition) at tweaking standard rules with minimal FPs. At times when I do get a FP I take a look at it and think "this one just deserves to get filtered." I'm willing to accept a certain amount, or a certain type, in order to be aggressive against spam. Before I only had access to user_prefs, but now that I have a server with root access it's a brand new ball game. The mechanics are easy enough, but I need to work on the broader strategies. Any particularly good reading to be recommended? John On Apr 29, 2006, at 8:12 AM, List Mail User wrote: ... Matt Kettler replied: John Tice wrote: Greetings, This is my first post after having lurked some. So, I'm getting these same "RE: good" spams but they're hitting eight rules and typically scoring between 30 and 40. I'm really unsophisticated compared to you guys, and it begs the question––what am I doing wrong? All I use is a tweaked user_prefs wherein I have gradually raised the scores on standard rules found in spam that slips through over a period of time. These particular spams are over the top on bayesian (1.0), have multiple database hits, forged rcvd_helo and so forth. Bayesian alone flags them for me. I'm trying to understand the reason you would not want to have these type of rules set high enough? I must be way over optimized––what am I not getting? BAYES_99, by definition, has a 1% false positive rate. Matt, If we were to presume a uniform distribution between a estimate of 99% and 100%, then the FP rate would be .5%, not 1%. And for large sites (i.e. 10s or thousands or messages a day or more), this may be what occurs; But what I see and what I assume many other small sites see is a very much non-uniform distribution; From the last 30 hours, the average estimate (re. the value reported in the "bayes=xxx" clause) for spam hitting the BAYES_99 rule is .41898013269 with about two thirds of them reporting bayes=1 and a lowest value of bayes=0.998721756590216. While SA is quite robust largely because of the design feature that no single reason/cause/rule should by itself mark a message as spam, I have to guess that the FP rate that the majority of users see for BAYES_99 is far below 1%. From the estimators reported above, I would expect that I would have seen a .003% FP rate for the last day plus a little, if only I received 100,000 or so spam messages to have been able to see it:). I don't change the scoring from the defaults, but if people were to want to, maybe they could change the rules (or add a rule) for BAYES_99_99 which would take only scores higher than bayes=. and which (again with a uniform distribution) have an expected FP rate of .005% - than re- score that just closer (but still less) than the spam threshold, or add a point of fraction thereof to raise the score to just under the spam threshhold (adding a new rule would avoid having to edit distributed files and thus would probably be the "better" method). Anyway, to better address the OP's questions: The system is more robust if instead of changing the weighting of existing rules (assuming that they were correctly established to begin with), you add more possible inputs (and preferably independant ones - i.e. where the FPs between rules have a low correlation). Simply increasing scores will improve your spam "capture" rate, just as decreasing the spam threshold will - but both methods will add to the likelyhood of false positives; Look into the distributed documentation to see the expected FP rates at different spam threshold levels for numbers to drive this point home (and changing specific rules' scores is just like changing the threshold, but in a non-uniform fashion - unless you actually measure the values for your own site's mail and recompute numbers that are a better estimate for local traffic). Paul Shupak [EMAIL PROTECTED]
Re: SQLite
Jonas Eckerman wrote: > Jakob Hirsch wrote: > >> I don't think SQLite itself is _that_ slow (in fact, I don't think it's >> slow at all), it's most probably a matter of optimization, > > SQL Lite *can* be very slow at some inserts/updates on some systems > because of how it handles writes. SQLite creates a temporary file for > each write operation, and also waits for writes to be safely finished by > the OS. > > If speed is more important than databse consistency, the SQL command > 'PRAGMA SYNCHRONOUS=OFF' makes SQLite a *lot* faster. It simply tells > SQLIte not to wait for every write to be finished. Yep, I've tried this. > > On a stable system with working backup routines running SQLite with > 'PRAGMA SYNCHRONOUS=OFF' for bayes makes a lot of sense. > > Is there any easy way to tell SpamAssassins SQL initializatiuon to run > specific commands directly after opening a database connection? > Or would it make more sense creating a > 'Mail::SpamAssassin::BayesStore::SQLite' that does this if told to? It has been awhile, but I believe you just need to do this at create time, so you'd only need a proper .sql file that did it. If you look in the "Attic" or whatever they call it in subversion, you'll see that there used to exist SQLite files. I believe a custom plugin would need to be created to make use of the transactional capabilities. However, I've done the work in the past and discovered it just was not worth it, you were better off sticking with Berkeley DBD or the MUCH faster SDBM. That said, that doesn't mean that I wouldn't welcome a contribution from someone who went off and did the work, so feel free to create the module and do the testing. Submit a bug with the code and results attached and I will strongly consider adding it to the source tree. > > (I'm moving stuff into a SQLite database in a MIMEDefang filter, so I'm > thinking of trying it out for bayes as well...) > >> If time permits, I'll do a benchmark run, anyway, > > Are there any ready made benchmark scripts for the bayes stuff? As Matt said you can find it here: http://wiki.apache.org/spamassassin/BayesBenchmark The actual benchmark code is here: http://wiki.apache.org/spamassassin-data/attachments/BayesBenchmark/attachments/benchmark.tar.gz I think that I've added enough documentation to get you up and running, but if you have questions, feel free to ask. Improvements to the benchmark are also more than welcome. Thanks Michael
Re: Those "Re: good obfupills" spams
List Mail User wrote: >> ... >> > > Matt Kettler replied: > > >> John Tice wrote: >> >>> Greetings, >>> This is my first post after having lurked some. So, I'm getting these >>> same "RE: good" spams but they're hitting eight rules and typically >>> scoring between 30 and 40. I'm really unsophisticated compared to you >>> guys, and it begs the question––what am I doing wrong? All I use is a >>> tweaked user_prefs wherein I have gradually raised the scores on >>> standard rules found in spam that slips through over a period of time. >>> These particular spams are over the top on bayesian (1.0), have >>> multiple database hits, forged rcvd_helo and so forth. Bayesian alone >>> flags them for me. I'm trying to understand the reason you would not >>> want to have these type of rules set high enough? I must be way over >>> optimized––what am I not getting? >>> >> BAYES_99, by definition, has a 1% false positive rate. >> >> > > Matt, > > If we were to presume a uniform distribution between a estimate of > 99% and 100%, then the FP rate would be .5%, not 1%. You're right Paul, my bad.. But again, I don't care if it's 0.01%. The question here is "is jacking up the score of BAYES_99 to be greater than required_hits a good idea". The answer is "No, because BAYES_99 is NOT a 100% accurate test. By definition it does have a non-zero FP rate. > And for large sites > (i.e. 10s or thousands or messages a day or more), this may be what occurs; > But what I see and what I assume many other small sites see is a very much > non-uniform distribution; From the last 30 hours, the average estimate (re. > the value reported in the "bayes=xxx" clause) for spam hitting the BAYES_99 > rule is .41898013269 with about two thirds of them reporting bayes=1 and > a lowest value of bayes=0.998721756590216. > Yes, that's to be expected with Chi-Squared combining. > While SA is quite robust largely because of the design feature that > no single reason/cause/rule should by itself mark a message as spam, I have > to guess that the FP rate that the majority of users see for BAYES_99 is far > below 1%. From the estimators reported above, I would expect that I would > have seen a .003% FP rate for the last day plus a little, if only I received > 100,000 or so spam messages to have been able to see it:). > True, but it's still not nearly zero. Even in the corpus testing, which is run by "the best of the best" in SA administration and maintenance, BAYES_99 matched 0.0396% of ham, or 21 out of 53,091 hams. (Based on set-3 of SA 3.1.0) Given we are dealing with user who doesn't even understand why you might not want this set "high enough", I would expect the level of sophistication in bayes maintenance Besides.. If you want to make a mathematics based argument against me, start by explaining how the perceptron mathematically is flawed. It assigned the original score based on real-world data. Not our vast over simplifications. You should have good reason to question its design before second guessing it's scoring based on speculation such as this. > I don't change the scoring from the defaults, but if people were to > want to, maybe they could change the rules (or add a rule) for BAYES_99_99 > which would take only scores higher than bayes=. and which (again with > a uniform distribution) have an expected FP rate of .005% - than re-score > that just closer (but still less) than the spam threshold, I'd agree.. However, the OP has already made BAYES_99 > required_hits. Bad idea. Period.
RE: Bayes troubles
OK... I did the greps you recommended and didn't find any use_dcc lines... I even did: grep use_dcc /home/sites/*/users/*/.spamassassin/user_prefs and still didn't find anything (checking all user directories). (actually, my running SA build is in /home/spam-filter... (bin, share, etc. - I'm on a cobalt RAQ and can't upgrade the primary PERL to PERL 5.8 - so I made a little subsystem) I found the report_contact flag in the 10_misc.cf in both /usr/share/spamassassin & /home/spam-filter/share/spamassassin. I have an old build in /usr/share/spamassassin that I need to delete (thanks for reminding me). I think I'll hold out until V3.1.2 is released since, according to traffic here, it is fairly close. (Maybe I'll download & install the latest razor and be razoring as well now). Thanks Matt! --Will -Original Message- From: Matt Kettler [mailto:[EMAIL PROTECTED] Sent: Friday, April 28, 2006 11:40 AM To: Will Nordmeyer Cc: users@spamassassin.apache.org Subject: Re: Bayes troubles Will Nordmeyer wrote: > Matt, > > I ran lint this AM (I frequently forget that part :-)), and only had 2 > issues - > > warn: config: failed to parse line, skipping: use_dcc 1 > warn: config: warning: score set for non-existent rule RAZOR2_CHECK > > I can't find where the use_dcc or the RAZOR2_CHECK are set though. None of > the .cf files in /etc/mail/spamassassin have them. Perhaps a user_prefs has them. Or if you have "inherited" a system,someone edited the /usr/share/ files? Or maybe someone put it in a .pre file in /etc/mail/spamassassin? grep use_dcc /usr/share/spamassassin/*.cf grep use_dcc /etc/mail/spamassassin/*.cf grep use_dcc /etc/mail/spamassassin/*.pre grep use_dcc ~/.spamassassin/user_prefs > > I tried running the spamassassin --lint --debug and dump the dbg output to a > file, but apparently I'm screwing up the redirect because my output file > always is empty. You can't redirect the debug output with > or |. It is output to stderr, not stdout. In bash type shells you can re-direct stderr using 2> instead of > > > I'm running via spamd and have restarted spamd. By the way, I'm running > V3.1.1 (and for some reason it puts @@CONTACT_ADDRESS@@ in the emails saying > that spam detection software running on blah blah blah - know how I can > easily fix that without having to rebuild?). That makes me fairly concerned about the integrity of the build. I'd strongly suggest rebuilding anyway. That said, you can edit /usr/share/spamassassin/10_misc.cf and edit the report_contact option there. BE VERY careful editing this, and be sure to lint afterwards. Note: In the general case I would advise against editing any of the .cf files in /usr/share/spamassassin. They will all be obliterated and re-written if you upgrade or re-install. In this case, that's perfectly fine.
SA & Razor problem - help requested
Hello Spamasssins, I am having an odd problem, I was hoping for some insight from those more adept than I. I am trying to get Razor working with Spamassassin to little effect. To put it simply, SA never uses RAZOR, and I have never in thousands of messages (http://www.flanigan.net/spam) seen even a single RAZOR hit. However, I get no errors in the error logs. The only error I see is on a `spamassassin lint` which says: [8611] warn: config: failed to parse line, skipping: use_razor2__1 [8611] warn: config: failed to parse line, skipping: razor_config __/etc/mail/spamassassin/.razor/razor-agent.conf Oddly, I get the exact same symptoms with DCC. I have compiled SA from scratch and installed in over the existing install just to make sure. I have searched the mailing list, and and I followed the wiki guide at spamassassin.apache.org for installing razor with SA. I have verifed that both SA and Razor work on there own, and have fed razor several test messages. SA works fine other than the razor problems. My config: 1. I am envoking spamc 3.1.1 through /etc/procmailrc using a simple: :0fw | /usr/bin/spamc 2. spamd is called with the following args: -u spamd -d -x -m5 - H /etc/mail/spamassassin/ -r /var/run/spamd.pid 3. I am running version 2.8.1 of razor clients 4. I am running the above on Linux Fedora Core 5 (kernel 2.6.16-1.2096_FC5). 5. My local.cf lines for razor are: My local.cf has the following lines for razor: use_razor2 1 razor_config/etc/mail/spamassassin/.razor/razor-agent.conf Any advise you could offer would be greatly appreciated! --- Kind Regards, David http://www.flanigan.net
Re: Those "Re: good obfupills" spams
>... Matt Kettler replied: >John Tice wrote: >> >> Greetings, >> This is my first post after having lurked some. So, I'm getting these >> same "RE: good" spams but they're hitting eight rules and typically >> scoring between 30 and 40. I'm really unsophisticated compared to you >> guys, and it begs the questionwhat am I doing wrong? All I use is a >> tweaked user_prefs wherein I have gradually raised the scores on >> standard rules found in spam that slips through over a period of time. >> These particular spams are over the top on bayesian (1.0), have >> multiple database hits, forged rcvd_helo and so forth. Bayesian alone >> flags them for me. I'm trying to understand the reason you would not >> want to have these type of rules set high enough? I must be way over >> optimizedwhat am I not getting? > > >BAYES_99, by definition, has a 1% false positive rate. > Matt, If we were to presume a uniform distribution between a estimate of 99% and 100%, then the FP rate would be .5%, not 1%. And for large sites (i.e. 10s or thousands or messages a day or more), this may be what occurs; But what I see and what I assume many other small sites see is a very much non-uniform distribution; From the last 30 hours, the average estimate (re. the value reported in the "bayes=xxx" clause) for spam hitting the BAYES_99 rule is .41898013269 with about two thirds of them reporting bayes=1 and a lowest value of bayes=0.998721756590216. While SA is quite robust largely because of the design feature that no single reason/cause/rule should by itself mark a message as spam, I have to guess that the FP rate that the majority of users see for BAYES_99 is far below 1%. From the estimators reported above, I would expect that I would have seen a .003% FP rate for the last day plus a little, if only I received 100,000 or so spam messages to have been able to see it:). I don't change the scoring from the defaults, but if people were to want to, maybe they could change the rules (or add a rule) for BAYES_99_99 which would take only scores higher than bayes=. and which (again with a uniform distribution) have an expected FP rate of .005% - than re-score that just closer (but still less) than the spam threshold, or add a point of fraction thereof to raise the score to just under the spam threshhold (adding a new rule would avoid having to edit distributed files and thus would probably be the "better" method). Anyway, to better address the OP's questions: The system is more robust if instead of changing the weighting of existing rules (assuming that they were correctly established to begin with), you add more possible inputs (and preferably independant ones - i.e. where the FPs between rules have a low correlation). Simply increasing scores will improve your spam "capture" rate, just as decreasing the spam threshold will - but both methods will add to the likelyhood of false positives; Look into the distributed documentation to see the expected FP rates at different spam threshold levels for numbers to drive this point home (and changing specific rules' scores is just like changing the threshold, but in a non-uniform fashion - unless you actually measure the values for your own site's mail and recompute numbers that are a better estimate for local traffic). Paul Shupak [EMAIL PROTECTED]
Re: span float obfuscation
Kenneth-san, thank you for your kindly advice. I've posted new rules to Bugzilla. But, it's a little bit difficult for me. ^^; BTW, I have more rules for catching various types of spams. Which is better for posting new rules? (1) first, posting new rules to this users ML, next, posting to Bugzilla (2) directly posting new rules to Bugzilla From: Kenneth Porter <[EMAIL PROTECTED]> Subject: Re: span float obfuscation Date: Fri, 28 Apr 2006 10:05:56 -0700 > On Saturday, April 29, 2006 1:48 AM +0900 MATSUDA Yoh-ichi <[EMAIL > PROTECTED]> > wrote: > > > May I post my rules to Bugzilla? > > Sounds good to me. I would have done so myself but wanted to make sure you > get attribution. You'll probably want to subscribe to the -devel list as > all bugzilla traffic goes through there. And as the wiki page recommends, > attach a sample spam to illustrate what the rule is supposed to catch. > > Once the rule is captured in bugzilla, a dev can get it into the automated > testing sandbox and we can see how the rule performs on their corpora. > > -- Nothing but a peace sign. MATSUDA Yoh-ichi(yoh) mailto:[EMAIL PROTECTED] http://www.flcl.org/~yoh/diary/ (only Japanese)
Re: Those "Re: good obfupills" spams
From: "Loren Wilton" <[EMAIL PROTECTED]> This is my first post after having lurked some. So, I'm getting these same "RE: good" spams but they're hitting eight rules and typically scoring between 30 and 40. I'm really unsophisticated compared to you guys, and it begs the question––what am I doing wrong? All I use is a tweaked user_prefs wherein I have gradually raised the scores on standard rules found in spam that slips through over a period of time. These particular spams are over the top on bayesian (1.0), have multiple database hits, forged rcvd_helo and so forth. Bayesian alone flags them for me. I'm trying to understand the reason you would not want to have these type of rules set high enough? I must be way over optimized––what am I not getting? The danger with tweaking standard rule scores you probably already know: you are at least theoretically likely to get more false positives, because the score set was optimized for the original scores. Of course, everyone tweaks a few scores at least. After all, that is why they are tweakable. As long as you watch you spam bucket for FPs you can go pretty high on things. Looking at today's spam I only see one of these, but it scored around 30. I have a bunch of the Re: news kind that all scored 35-39. Loren And most of those which are not black lists are from 88_FVGT_body.cf. {^_^}Joanne
Re: Those "Re: good obfupills" spams
> This is my first post after having lurked some. So, I'm getting these > same "RE: good" spams but they're hitting eight rules and typically > scoring between 30 and 40. I'm really unsophisticated compared to you > guys, and it begs the question––what am I doing wrong? All I use is a > tweaked user_prefs wherein I have gradually raised the scores on > standard rules found in spam that slips through over a period of > time. These particular spams are over the top on bayesian (1.0), have > multiple database hits, forged rcvd_helo and so forth. Bayesian alone > flags them for me. I'm trying to understand the reason you would not > want to have these type of rules set high enough? I must be way over > optimized––what am I not getting? The danger with tweaking standard rule scores you probably already know: you are at least theoretically likely to get more false positives, because the score set was optimized for the original scores. Of course, everyone tweaks a few scores at least. After all, that is why they are tweakable. As long as you watch you spam bucket for FPs you can go pretty high on things. Looking at today's spam I only see one of these, but it scored around 30. I have a bunch of the Re: news kind that all scored 35-39. Loren
Re: Those "Re: good obfupills" spams
From: "Matt Kettler" <[EMAIL PROTECTED]> jdow wrote: BAYES_99, by definition, has a 1% false positive rate. That is what Bayes thinks. I think it is closer to something between 0.5% and 0.1% false positive. I have mine trained down lethally fine at this point, it appears. Ok.. Fine, let's take 0.1% FP rate, 10x better than theoretical, but still realistic at some sites.. Even still.. Is that low enough to be worth assigning >5.0 points to? No. So far, however, it has been worth 5.0 points. I've had it (actually) false positive maybe once in the last month. I've had SA mismark some BAYES_99 spam, however. The spam had other characteristics that earned a slight negative score. (I've since developed some meta rules that are reducing this. It the email is from a mailing list I know I give a modest negative score. Then if the Bayes is high or very high I award some positive points. High plus mailing list is about 2 points with mailing list being -1.5. Very high adds another 2 points. That second two points MAY have to be fine tuned upwards.) {^_^}