Re: New Bayes like paradigm
On 10/13, Adam Katz wrote: > PS: As an SA Committer, do I have access to those logs? Don't think so, but you can just ask for a regular masscheck account if you don't already have one, and with that account do: rsync --exclude '*~' -vaz "rsync.spamassassin.org::corpus" ./ -- "I'd rather be happy than right any day." - Slartiblartfast, The Hitchhiker's Guide to the Galaxy http://www.ChaosReigns.com
Re: New Bayes like paradigm
> On 9/28/2011 8:02 AM, dar...@chaosreigns.com wrote: >> You definitely have a good point that it would only be necessary to >> track the combinations that actually show up in emails, however >> 1024 is only the possible combinations from one set of 10 rules. >> The number of combinations in the actual corpora would be much >> higher. I'll try to get you a number. On 10/10/2011 06:55 AM, Marc Perkel wrote: > You wouldn't have to store all combinations. You could just do up to > 3 levels and only the combinations that actually occur and use a hash > to look up the combinations. The data is all there if you have access to the spam.log and ham.log files created by mass-check (warning, this code was composed in email, not vim, and it has not been run): # #!/bin/sh # Give three rules as arguments. Assumes ham.log and spam.log in PWD export GREP_OPTIONS="--mmap" tp=`grep -w "$1" spam.log |grep -w "$2" |grep -wc "$3"` fp=`grep -w "$1" ham.log |grep -w "$2" |grep -wc "$3"` spams=`grep -c '^[^#]' spam.log` hams=` grep -c '^[^#]' ham.log` tpr=`echo "scale=5; $tp * 100 / $spams" |bc` fpr=`echo "scale=5; $fp * 100 / $hams " |bc` so=`echo "scale=4; $tpr / ($tpr + $fpr)" |bc` echo "meta rule $1 && $2 && $3" echo " SPAM% $tpr HAM% $fpr S/O $so" # Now you can pick your thresholds for moving forward (and your thresholds for saving a combination as a no-go in the future). These numbers are just as valid as anything you'd get through the actual mass-check run. Still, I worry about what this does to the GA. PS: As an SA Committer, do I have access to those logs? signature.asc Description: OpenPGP digital signature
Re: New Bayes like paradigm
On 10/10/2011 9:16 AM, dar...@chaosreigns.com wrote: On 10/10, Marc Perkel wrote: On 9/28/2011 8:02 AM, dar...@chaosreigns.com wrote: On 09/28, Marc Perkel wrote: You would only have to test the rule combinations that the message actually triggered. So if it hit 10 rules then it would be 1024 combinations. Seems not to be unreasonable to me. You definitely have a good point that it would only be necessary to track the combinations that actually show up in emails, however 1024 is only the possible combinations from one set of 10 rules. The number of combinations in the actual corpora would be much higher. I'll try to get you a number. You wouldn't have to store all combinations. You could just do up to 3 levels and only the combinations that actually occur and use a hash to look up the combinations. I never said storage would be a problem. I agree you could just store a relatively small number that were most useful. The problems are: 1) The many years it would take to find useful rule combinations by trying one possibility per masscheck run. 2) The hundreds of times as much (masscheck) data we'd need to get an accurate re-score using all rule combinations existing in the corpora. There is still the possibility of doing an analysis of what combinations of rules hit false-negatives significantly more often than they hit non-spam. (Or false-positives vs. spam.) I suppose it seems to me that there should be some automated way to find useful rule combinations. -- Marc Perkel - Sales/Support supp...@junkemailfilter.com http://www.junkemailfilter.com Junk Email Filter dot com 415-992-3400
Re: New Bayes like paradigm
On 10/10, Marc Perkel wrote: > On 9/28/2011 8:02 AM, dar...@chaosreigns.com wrote: > >On 09/28, Marc Perkel wrote: > >>You would only have to test the rule combinations that the message > >>actually triggered. So if it hit 10 rules then it would be 1024 > >>combinations. Seems not to be unreasonable to me. > >You definitely have a good point that it would only be necessary to track > >the combinations that actually show up in emails, however 1024 is only > >the possible combinations from one set of 10 rules. The number of > >combinations in the actual corpora would be much higher. I'll try to > >get you a number. > > You wouldn't have to store all combinations. You could just do up to > 3 levels and only the combinations that actually occur and use a > hash to look up the combinations. I never said storage would be a problem. I agree you could just store a relatively small number that were most useful. The problems are: 1) The many years it would take to find useful rule combinations by trying one possibility per masscheck run. 2) The hundreds of times as much (masscheck) data we'd need to get an accurate re-score using all rule combinations existing in the corpora. There is still the possibility of doing an analysis of what combinations of rules hit false-negatives significantly more often than they hit non-spam. (Or false-positives vs. spam.) -- Immorality: "The morality of those who are having a better time" - Henry Louis Mencken http://www.ChaosReigns.com
Re: New Bayes like paradigm
On 9/28/2011 8:02 AM, dar...@chaosreigns.com wrote: On 09/28, Marc Perkel wrote: You would only have to test the rule combinations that the message actually triggered. So if it hit 10 rules then it would be 1024 combinations. Seems not to be unreasonable to me. You definitely have a good point that it would only be necessary to track the combinations that actually show up in emails, however 1024 is only the possible combinations from one set of 10 rules. The number of combinations in the actual corpora would be much higher. I'll try to get you a number. You wouldn't have to store all combinations. You could just do up to 3 levels and only the combinations that actually occur and use a hash to look up the combinations. -- Marc Perkel - Sales/Support supp...@junkemailfilter.com http://www.junkemailfilter.com Junk Email Filter dot com 415-992-3400
Re: New Bayes like paradigm
On 09/28, dar...@chaosreigns.com wrote: > On 09/28, Marc Perkel wrote: > > You would only have to test the rule combinations that the message > > actually triggered. So if it hit 10 rules then it would be 1024 > > combinations. Seems not to be unreasonable to me. > combinations in the actual corpora would be much higher. I'll try to > get you a number. 360,468. Combinations of rules seen in the actual mass-check corpora, from the latest -net run (2011-09-24), after stripping out T_* and __* rules, but not stripping out "tflags nopublish" rules. So that would only take about 394 times as much data submitted via mass-check as we currently have, to maintain a similar level of accuracy :) Seems likely I could find something useful in this direction though. Looking for combinations of 2 or 3 rules that show up relatively often in mis-categorized emails. -- "Am I a man who dreamed I was a butterfly, or am I a butterfly who is dreaming I am a man?" - Chuang Tsu, ~350 BC http://www.ChaosReigns.com
Re: New Bayes like paradigm
On 09/28, Marc Perkel wrote: > You would only have to test the rule combinations that the message > actually triggered. So if it hit 10 rules then it would be 1024 > combinations. Seems not to be unreasonable to me. You definitely have a good point that it would only be necessary to track the combinations that actually show up in emails, however 1024 is only the possible combinations from one set of 10 rules. The number of combinations in the actual corpora would be much higher. I'll try to get you a number. -- "Will I ever learn? I hope not, I'm having too much fun." - Brent "Minime" Avis, motorcycle.com http://www.ChaosReigns.com
Re: New Bayes like paradigm
On 9/27/2011 9:25 PM, dar...@chaosreigns.com wrote: On 09/27, Marc Perkel wrote: Here's the kind of think I'm seeing. Spam talks about money - low score. Spam talks about Jesus - low score. Spam talks about money and Jesus and throw in a dear someone and it's spam. I'm hoping to detect combinations automatcally. You're not really talking about something bayes does. But I've thought a little bit about doing something like it. People contributing mass-check data have access to everybody else's data (just the rule hit counts, not actual email contents), so I can do statistical analysis to find patterns like this. The problem, which we come across over and over again, is not enough data. We barely get enough mass-check data to provide useful scores with the existing method, where you're only analyzing the frequency of individual rules, basically. When you start analyzing frequencies of patterns, you need a lot more data. So yeah, you could write a score generator that, instead of coming up with: test A = 0.3 test B = 0.1 test C = 4 Comes up with optimal scores for all possible combinations: test A = 0 test B = 0.1 test C = 0.2 test A+B = 6 test A+C = 5.3 test B+C = 2 test A+B+C = -0.3 (wouldn't that be fun?) But score generation requires a significant number of email samples with each test, and, "A+B" ends up becoming an additional test, with far fewer samples. It causes... exponential problems with the input data required. I might even have tried it and have code laying around somewhere. If only I had the data of a large email provider, accurately sorted into spam and non-spam. Hell, once you're doing analysis of all the possible combinations of test hits, you hardly even have a use for scores, and can just reduce your results to "this combination is spam" and "this combination is not spam". Sexy. Ooh, I can make the problem clearer. Currently, score generation won't trigger unless the mass-check corpora contains 150,000 hams (non-spams) and 150,000 spams. So say we need 300,000 emails, hand sorted, to calculate scores. And the 50_scores.cf file contains 913 rules. So, for rough estimation, say that works out to needing 300,000/913 = 328.6 emails per rule. Now how many combinations of rules can you come up with if you start with 913 rules? I don't remember how to calculate it, but I can tell you it's freaking huge. Then multiply it by 328.6 to get the number of emails we'd need to calculate accurate scores for each combination. Of course it would be likely to be useful to only track scores for combinations of up to, say, 10 rules, which would significantly reduce the problem, but it would still be nasty. Hmm, doesn't look as bad as I thought. With 913 rules, the number of combinations of 4 rules looks like 28761672940. 3 rules: 126424936 2 rules: 416328 1 rule: 913 (yay, this step at least isn't horribly wrong) So 28761672940+126424936+416328+913 = 2515117 possible combinations of 1 to 4 rules, multiply by 328.6, and we need 9492766067446 emails, hand sorted into ham and spam, to come up with accurate scores for those combinations (of just 1-4 tests). And it looks like we're not currently even getting enough for score generation to work as is, and that's still 31 MILLION TIMES the minimum number of emails required by the current system. And that still doesn't address the problem of handling emails that hit more than 4 rules, although, in comparison, I think that one would be easy. Somebody please show me where I'm wrong on the number of emails required, and how we can actually make this happen. Because that would be fun. http://www.mathsisfun.com/combinatorics/combinations-permutations.html Combinations without Repetition http://ardoino.altervista.org/blog/index.php?id=48 - how to do factorials in bc. You would only have to test the rule combinations that the message actually triggered. So if it hit 10 rules then it would be 1024 combinations. Seems not to be unreasonable to me. -- Marc Perkel - Sales/Support supp...@junkemailfilter.com http://www.junkemailfilter.com Junk Email Filter dot com 415-992-3400
Re: New Bayes like paradigm
Another possibility would be to generate meta rules from random sets of three rules. Some (actually random) examples: meta RANDOM_3_A = (MPART_ALT_DIFF && GAPPY_SUBJECT && URI_UNSUBSCRIBE) meta RANDOM_3_B = (RCVD_IN_MAPS_OPS && WEIRD_PORT && FSL_FAKE_GMAIL_RCVD) meta RANDOM_3_C = (FB_CAN_LONGER && FU_HOODIA && RCVD_IN_NJABL_PROXY) And, one rule at a time, re-run score generation to see if it comes up with a higher accuracy result. If it does, you'd need to re-run score generation a few more times with and without the additional rule to verify it's not just a fluke of the random selection of train vs. test corpora. You could increase your chances by focusing on rules that show up in emails that were incorrectly categorized (false positives / negatives). Info on running score generation: http://wiki.apache.org/spamassassin/RescoreMassCheck Sounds more reasonable than my last post, right? I don't remember how long it takes, but say you can run it 10 times per day, and with 913 rules there are 126424936 possible combinations of 3 rules, so that would take you about 34,614 years. But that's without focusing on the combinations that show up in incorrectly classified emails. Maybe we could get distributed.net in on it. -- "It is the first responsibility of every citizen to question authority." - Benjamin Franklin http://www.ChaosReigns.com
Re: New Bayes like paradigm
On 09/27, Marc Perkel wrote: > Here's the kind of think I'm seeing. Spam talks about money - low > score. Spam talks about Jesus - low score. Spam talks about money > and Jesus and throw in a dear someone and it's spam. I'm hoping to > detect combinations automatcally. You're not really talking about something bayes does. But I've thought a little bit about doing something like it. People contributing mass-check data have access to everybody else's data (just the rule hit counts, not actual email contents), so I can do statistical analysis to find patterns like this. The problem, which we come across over and over again, is not enough data. We barely get enough mass-check data to provide useful scores with the existing method, where you're only analyzing the frequency of individual rules, basically. When you start analyzing frequencies of patterns, you need a lot more data. So yeah, you could write a score generator that, instead of coming up with: test A = 0.3 test B = 0.1 test C = 4 Comes up with optimal scores for all possible combinations: test A = 0 test B = 0.1 test C = 0.2 test A+B = 6 test A+C = 5.3 test B+C = 2 test A+B+C = -0.3 (wouldn't that be fun?) But score generation requires a significant number of email samples with each test, and, "A+B" ends up becoming an additional test, with far fewer samples. It causes... exponential problems with the input data required. I might even have tried it and have code laying around somewhere. If only I had the data of a large email provider, accurately sorted into spam and non-spam. Hell, once you're doing analysis of all the possible combinations of test hits, you hardly even have a use for scores, and can just reduce your results to "this combination is spam" and "this combination is not spam". Sexy. Ooh, I can make the problem clearer. Currently, score generation won't trigger unless the mass-check corpora contains 150,000 hams (non-spams) and 150,000 spams. So say we need 300,000 emails, hand sorted, to calculate scores. And the 50_scores.cf file contains 913 rules. So, for rough estimation, say that works out to needing 300,000/913 = 328.6 emails per rule. Now how many combinations of rules can you come up with if you start with 913 rules? I don't remember how to calculate it, but I can tell you it's freaking huge. Then multiply it by 328.6 to get the number of emails we'd need to calculate accurate scores for each combination. Of course it would be likely to be useful to only track scores for combinations of up to, say, 10 rules, which would significantly reduce the problem, but it would still be nasty. Hmm, doesn't look as bad as I thought. With 913 rules, the number of combinations of 4 rules looks like 28761672940. 3 rules: 126424936 2 rules: 416328 1 rule: 913 (yay, this step at least isn't horribly wrong) So 28761672940+126424936+416328+913 = 2515117 possible combinations of 1 to 4 rules, multiply by 328.6, and we need 9492766067446 emails, hand sorted into ham and spam, to come up with accurate scores for those combinations (of just 1-4 tests). And it looks like we're not currently even getting enough for score generation to work as is, and that's still 31 MILLION TIMES the minimum number of emails required by the current system. And that still doesn't address the problem of handling emails that hit more than 4 rules, although, in comparison, I think that one would be easy. Somebody please show me where I'm wrong on the number of emails required, and how we can actually make this happen. Because that would be fun. http://www.mathsisfun.com/combinatorics/combinations-permutations.html Combinations without Repetition http://ardoino.altervista.org/blog/index.php?id=48 - how to do factorials in bc. -- "But do you have any idea how many SuperBalls you could buy if you actually applied yourself in the world? Probably eleven, but you should still try." - http://hyperboleandahalf.blogspot.com/ http://www.ChaosReigns.com
Re: New Bayes like paradigm
On 9/25/2011 5:37 PM, RW wrote: On Sun, 25 Sep 2011 09:28:32 -0700 Marc Perkel wrote: Here's what I'd like to be able to do. I'd like a program of some sort where I could take word tokes - like name of rules that were triggered - and look for rule combinations that indicate spam or ham. For example, a message triggers 4 rules A B C and D. These rules are combined as follows: A ... ABCD Each rule combo is then looked up for how often it occurs in spam and how often it occurs in ham. Then the results are combined into some sort of likelihood of being spam or ham. There are a couple of problems with this. The first is that most SA rules are either neutral or strong spam indicators, which make them unsuitable for the sort of techniques used in Bayes. The second is that most of the scope for meaningful combinations is in high-scoring spam. Low-scoring spams are low-scoring because SA couldn't find much evidence - in these you're going to end-up with meaningless strong+neutral combinations like BAYES_99+SPF_PASS. That's not to say that it can't be done in a more general sense; the scoring system is a way of converting rule combinations into a classification. Similar questions have been asked before, IIRC someone came-up with an alternative way of getting a classification from the rule hits based on learning, and made a basic plugin that tweaked the score accordingly. Here's the kind of think I'm seeing. Spam talks about money - low score. Spam talks about Jesus - low score. Spam talks about money and Jesus and throw in a dear someone and it's spam. I'm hoping to detect combinations automatcally. -- Marc Perkel - Sales/Support supp...@junkemailfilter.com http://www.junkemailfilter.com Junk Email Filter dot com 415-992-3400
Re: New Bayes like paradigm
On Sun, 25 Sep 2011 09:28:32 -0700 Marc Perkel wrote: > Here's what I'd like to be able to do. I'd like a program of some > sort where I could take word tokes - like name of rules that were > triggered - and look for rule combinations that indicate spam or ham. > For example, a message triggers 4 rules A B C and D. These rules are > combined as follows: > > A > ... > ABCD > > Each rule combo is then looked up for how often it occurs in spam and > how often it occurs in ham. Then the results are combined into some > sort of likelihood of being spam or ham. > There are a couple of problems with this. The first is that most SA rules are either neutral or strong spam indicators, which make them unsuitable for the sort of techniques used in Bayes. The second is that most of the scope for meaningful combinations is in high-scoring spam. Low-scoring spams are low-scoring because SA couldn't find much evidence - in these you're going to end-up with meaningless strong+neutral combinations like BAYES_99+SPF_PASS. That's not to say that it can't be done in a more general sense; the scoring system is a way of converting rule combinations into a classification. Similar questions have been asked before, IIRC someone came-up with an alternative way of getting a classification from the rule hits based on learning, and made a basic plugin that tweaked the score accordingly.
Re: New Bayes like paradigm
On Sun, 25 Sep 2011 09:28:32 -0700, Marc Perkel wrote: Hope you all understand what I'm saying here. How would someone do something like that? meta foo ((a + b + c + d) > x) where x is how many of the rules that need to hit then make __a __b __c __d body header what ever you like to scan for was it more meta foo (a && b) triel and error mode :=)
Re: New Bayes like paradigm
On Sun, 25 Sep 2011 09:28:32 -0700 Marc Perkel wrote: > Each rule combo is then looked up for how often it occurs in spam and > how often it occurs in ham. Then the results are combined into some > sort of likelihood of being spam or ham. We looked at (and even implemented) some "meta-tokens" that we throw into Bayes. We found that trying to be too clever is self-defeating; plain old Bayes is generally better at locking on to spam/ham trends than feeding it derived data. Anyway, in theory, SA's genetic weighting algorithm should give you the desired results, though I suppose a Bayes-like approach might tweak the rule weights for your specific mail stream. Regards, David.
New Bayes like paradigm
Here's what I'd like to be able to do. I'd like a program of some sort where I could take word tokes - like name of rules that were triggered - and look for rule combinations that indicate spam or ham. For example, a message triggers 4 rules A B C and D. These rules are combined as follows: A B C D AB AC AD BC BD CD ABC ABD ACD BCD ABCD Each rule combo is then looked up for how often it occurs in spam and how often it occurs in ham. Then the results are combined into some sort of likelihood of being spam or ham. Hope you all understand what I'm saying here. How would someone do something like that? -- Marc Perkel - Sales/Support supp...@junkemailfilter.com http://www.junkemailfilter.com Junk Email Filter dot com 415-992-3400