Re: New Bayes like paradigm

2011-10-14 Thread darxus
On 10/13, Adam Katz wrote:
> PS:  As an SA Committer, do I have access to those logs?

Don't think so, but you can just ask for a regular masscheck account if you
don't already have one, and with that account do:

rsync --exclude '*~' -vaz "rsync.spamassassin.org::corpus" ./

-- 
"I'd rather be happy than right any day."
- Slartiblartfast, The Hitchhiker's Guide to the Galaxy
http://www.ChaosReigns.com


Re: New Bayes like paradigm

2011-10-13 Thread Adam Katz
> On 9/28/2011 8:02 AM, dar...@chaosreigns.com wrote:
>> You definitely have a good point that it would only be necessary to
>> track the combinations that actually show up in emails, however
>> 1024 is only the possible combinations from one set of 10 rules.
>> The number of combinations in the actual corpora would be much
>> higher.  I'll try to get you a number.

On 10/10/2011 06:55 AM, Marc Perkel wrote:
> You wouldn't have to store all combinations. You could just do up to
> 3 levels and only the combinations that actually occur and use a hash
> to look up the combinations.

The data is all there if you have access to the spam.log and ham.log
files created by mass-check (warning, this code was composed in email,
not vim, and it has not been run):

#
#!/bin/sh
# Give three rules as arguments.  Assumes ham.log and spam.log in PWD

export GREP_OPTIONS="--mmap"

tp=`grep -w "$1" spam.log |grep -w "$2" |grep -wc "$3"`
fp=`grep -w "$1"  ham.log |grep -w "$2" |grep -wc "$3"`

spams=`grep -c '^[^#]' spam.log`
hams=` grep -c '^[^#]' ham.log`

tpr=`echo "scale=5; $tp * 100 / $spams" |bc`
fpr=`echo "scale=5; $fp * 100 / $hams " |bc`

so=`echo "scale=4; $tpr / ($tpr + $fpr)" |bc`

echo "meta rule  $1 && $2 && $3"
echo "  SPAM% $tpr   HAM% $fpr   S/O $so"
#

Now you can pick your thresholds for moving forward (and your thresholds
for saving a combination as a no-go in the future).  These numbers are
just as valid as anything you'd get through the actual mass-check run.

Still, I worry about what this does to the GA.


PS:  As an SA Committer, do I have access to those logs?



signature.asc
Description: OpenPGP digital signature


Re: New Bayes like paradigm

2011-10-13 Thread Marc Perkel



On 10/10/2011 9:16 AM, dar...@chaosreigns.com wrote:

On 10/10, Marc Perkel wrote:

On 9/28/2011 8:02 AM, dar...@chaosreigns.com wrote:

On 09/28, Marc Perkel wrote:

You would only have to test the rule combinations that the message
actually triggered. So if it hit 10 rules then it would be 1024
combinations. Seems not to be unreasonable to me.

You definitely have a good point that it would only be necessary to track
the combinations that actually show up in emails, however 1024 is only
the possible combinations from one set of 10 rules.  The number of
combinations in the actual corpora would be much higher.  I'll try to
get you a number.

You wouldn't have to store all combinations. You could just do up to
3 levels and only the combinations that actually occur and use a
hash to look up the combinations.

I never said storage would be a problem.  I agree you could just store a
relatively small number that were most useful.

The problems are:
1) The many years it would take to find useful rule combinations by trying
one possibility per masscheck run.
2) The hundreds of times as much (masscheck) data we'd need to get an
accurate re-score using all rule combinations existing in the corpora.

There is still the possibility of doing an analysis of what combinations of
rules hit false-negatives significantly more often than they hit non-spam.
(Or false-positives vs. spam.)


I suppose it seems to me that there should be some automated way to find 
useful rule combinations.



--
Marc Perkel - Sales/Support
supp...@junkemailfilter.com
http://www.junkemailfilter.com
Junk Email Filter dot com
415-992-3400



Re: New Bayes like paradigm

2011-10-10 Thread darxus
On 10/10, Marc Perkel wrote:
> On 9/28/2011 8:02 AM, dar...@chaosreigns.com wrote:
> >On 09/28, Marc Perkel wrote:
> >>You would only have to test the rule combinations that the message
> >>actually triggered. So if it hit 10 rules then it would be 1024
> >>combinations. Seems not to be unreasonable to me.
> >You definitely have a good point that it would only be necessary to track
> >the combinations that actually show up in emails, however 1024 is only
> >the possible combinations from one set of 10 rules.  The number of
> >combinations in the actual corpora would be much higher.  I'll try to
> >get you a number.
> 
> You wouldn't have to store all combinations. You could just do up to
> 3 levels and only the combinations that actually occur and use a
> hash to look up the combinations.

I never said storage would be a problem.  I agree you could just store a
relatively small number that were most useful.

The problems are:
1) The many years it would take to find useful rule combinations by trying
   one possibility per masscheck run.
2) The hundreds of times as much (masscheck) data we'd need to get an
   accurate re-score using all rule combinations existing in the corpora.

There is still the possibility of doing an analysis of what combinations of
rules hit false-negatives significantly more often than they hit non-spam.
(Or false-positives vs. spam.)

-- 
Immorality: "The morality of those who are having a better time"
- Henry Louis Mencken
http://www.ChaosReigns.com


Re: New Bayes like paradigm

2011-10-10 Thread Marc Perkel



On 9/28/2011 8:02 AM, dar...@chaosreigns.com wrote:

On 09/28, Marc Perkel wrote:

You would only have to test the rule combinations that the message
actually triggered. So if it hit 10 rules then it would be 1024
combinations. Seems not to be unreasonable to me.

You definitely have a good point that it would only be necessary to track
the combinations that actually show up in emails, however 1024 is only
the possible combinations from one set of 10 rules.  The number of
combinations in the actual corpora would be much higher.  I'll try to
get you a number.


You wouldn't have to store all combinations. You could just do up to 3 
levels and only the combinations that actually occur and use a hash to 
look up the combinations.


--
Marc Perkel - Sales/Support
supp...@junkemailfilter.com
http://www.junkemailfilter.com
Junk Email Filter dot com
415-992-3400



Re: New Bayes like paradigm

2011-09-28 Thread darxus
On 09/28, dar...@chaosreigns.com wrote:
> On 09/28, Marc Perkel wrote:
> > You would only have to test the rule combinations that the message
> > actually triggered. So if it hit 10 rules then it would be 1024
> > combinations. Seems not to be unreasonable to me.

> combinations in the actual corpora would be much higher.  I'll try to
> get you a number.

360,468.  Combinations of rules seen in the actual mass-check corpora, from
the latest -net run (2011-09-24), after stripping out T_* and __* rules,
but not stripping out "tflags nopublish" rules.  So that would only take
about 394 times as much data submitted via mass-check as we currently have,
to maintain a similar level of accuracy :)

Seems likely I could find something useful in this direction though.
Looking for combinations of 2 or 3 rules that show up relatively often in
mis-categorized emails.

-- 
"Am I a man who dreamed I was a butterfly, or am I a butterfly who is
dreaming I am a man?" - Chuang Tsu, ~350 BC
http://www.ChaosReigns.com


Re: New Bayes like paradigm

2011-09-28 Thread darxus
On 09/28, Marc Perkel wrote:
> You would only have to test the rule combinations that the message
> actually triggered. So if it hit 10 rules then it would be 1024
> combinations. Seems not to be unreasonable to me.

You definitely have a good point that it would only be necessary to track
the combinations that actually show up in emails, however 1024 is only
the possible combinations from one set of 10 rules.  The number of
combinations in the actual corpora would be much higher.  I'll try to
get you a number.

-- 
"Will I ever learn? I hope not, I'm having too much fun."
- Brent "Minime" Avis, motorcycle.com
http://www.ChaosReigns.com


Re: New Bayes like paradigm

2011-09-28 Thread Marc Perkel



On 9/27/2011 9:25 PM, dar...@chaosreigns.com wrote:

On 09/27, Marc Perkel wrote:

Here's the kind of think I'm seeing. Spam talks about money - low
score. Spam talks about Jesus - low score. Spam talks about money
and Jesus and throw in a dear someone and it's spam. I'm hoping to
detect combinations automatcally.

You're not really talking about something bayes does.

But I've thought a little bit about doing something like it.  People
contributing mass-check data have access to everybody else's data (just
the rule hit counts, not actual email contents), so I can do statistical
analysis to find patterns like this.  The problem, which we come across
over and over again, is not enough data.  We barely get enough mass-check
data to provide useful scores with the existing method, where you're
only analyzing the frequency of individual rules, basically.  When
you start analyzing frequencies of patterns, you need a lot more data.

So yeah, you could write a score generator that, instead of coming up with:

test A = 0.3
test B = 0.1
test C = 4

Comes up with optimal scores for all possible combinations:

test A = 0
test B = 0.1
test C = 0.2
test A+B = 6
test A+C = 5.3
test B+C = 2
test A+B+C = -0.3  (wouldn't that be fun?)

But score generation requires a significant number of email samples with
each test, and, "A+B" ends up becoming an additional test, with far fewer
samples.  It causes... exponential problems with the input data required.
I might even have tried it and have code laying around somewhere.  If only
I had the data of a large email provider, accurately sorted into spam and
non-spam.

Hell, once you're doing analysis of all the possible combinations of test
hits, you hardly even have a use for scores, and can just reduce your
results to "this combination is spam" and "this combination is not spam".
Sexy.

Ooh, I can make the problem clearer.  Currently, score generation won't
trigger unless the mass-check corpora contains 150,000 hams (non-spams) and
150,000 spams.  So say we need 300,000 emails, hand sorted, to calculate
scores.  And the 50_scores.cf file contains 913 rules.  So, for rough
estimation, say that works out to needing 300,000/913 = 328.6 emails per
rule.  Now how many combinations of rules can you come up with if you start
with 913 rules?  I don't remember how to calculate it, but I can tell you
it's freaking huge.  Then multiply it by 328.6 to get the number of emails
we'd need to calculate accurate scores for each combination.  Of course it
would be likely to be useful to only track scores for combinations of up
to, say, 10 rules, which would significantly reduce the problem, but it
would still be nasty.

Hmm, doesn't look as bad as I thought.  With 913 rules, the number of
combinations of 4 rules looks like 28761672940.
3 rules: 126424936
2 rules: 416328
1 rule: 913 (yay, this step at least isn't horribly wrong)

So 28761672940+126424936+416328+913 = 2515117 possible combinations of
1 to 4 rules, multiply by 328.6, and we need 9492766067446 emails, hand
sorted into ham and spam, to come up with accurate scores for those
combinations (of just 1-4 tests).  And it looks like we're not currently
even getting enough for score generation to work as is, and that's
still 31 MILLION TIMES the minimum number of emails required by the
current system.

And that still doesn't address the problem of handling emails that hit more
than 4 rules, although, in comparison, I think that one would be easy.


Somebody please show me where I'm wrong on the number of emails required,
and how we can actually make this happen.  Because that would be fun.


http://www.mathsisfun.com/combinatorics/combinations-permutations.html
Combinations without Repetition
http://ardoino.altervista.org/blog/index.php?id=48  - how to do factorials
in bc.


You would only have to test the rule combinations that the message 
actually triggered. So if it hit 10 rules then it would be 1024 
combinations. Seems not to be unreasonable to me.


--
Marc Perkel - Sales/Support
supp...@junkemailfilter.com
http://www.junkemailfilter.com
Junk Email Filter dot com
415-992-3400



Re: New Bayes like paradigm

2011-09-27 Thread darxus
Another possibility would be to generate meta rules from random sets of
three rules.  Some (actually random) examples:

meta RANDOM_3_A = (MPART_ALT_DIFF && GAPPY_SUBJECT && URI_UNSUBSCRIBE)
meta RANDOM_3_B = (RCVD_IN_MAPS_OPS && WEIRD_PORT && FSL_FAKE_GMAIL_RCVD)
meta RANDOM_3_C = (FB_CAN_LONGER && FU_HOODIA && RCVD_IN_NJABL_PROXY)

And, one rule at a time, re-run score generation to see if it comes up with
a higher accuracy result.  If it does, you'd need to re-run score
generation a few more times with and without the additional rule to verify
it's not just a fluke of the random selection of train vs. test corpora.

You could increase your chances by focusing on rules that show up in emails
that were incorrectly categorized (false positives / negatives).  

Info on running score generation:
http://wiki.apache.org/spamassassin/RescoreMassCheck


Sounds more reasonable than my last post, right?

I don't remember how long it takes, but say you can run it 10 times per
day, and with 913 rules there are 126424936 possible combinations of 3
rules, so that would take you about 34,614 years.

But that's without focusing on the combinations that show up in incorrectly
classified emails.  Maybe we could get distributed.net in on it.

-- 
"It is the first responsibility of every citizen to question authority."
- Benjamin Franklin
http://www.ChaosReigns.com


Re: New Bayes like paradigm

2011-09-27 Thread darxus
On 09/27, Marc Perkel wrote:
> Here's the kind of think I'm seeing. Spam talks about money - low
> score. Spam talks about Jesus - low score. Spam talks about money
> and Jesus and throw in a dear someone and it's spam. I'm hoping to
> detect combinations automatcally.

You're not really talking about something bayes does.

But I've thought a little bit about doing something like it.  People
contributing mass-check data have access to everybody else's data (just
the rule hit counts, not actual email contents), so I can do statistical
analysis to find patterns like this.  The problem, which we come across
over and over again, is not enough data.  We barely get enough mass-check
data to provide useful scores with the existing method, where you're
only analyzing the frequency of individual rules, basically.  When
you start analyzing frequencies of patterns, you need a lot more data.  

So yeah, you could write a score generator that, instead of coming up with:

test A = 0.3
test B = 0.1
test C = 4

Comes up with optimal scores for all possible combinations:

test A = 0
test B = 0.1
test C = 0.2
test A+B = 6
test A+C = 5.3
test B+C = 2
test A+B+C = -0.3  (wouldn't that be fun?)

But score generation requires a significant number of email samples with
each test, and, "A+B" ends up becoming an additional test, with far fewer
samples.  It causes... exponential problems with the input data required.
I might even have tried it and have code laying around somewhere.  If only
I had the data of a large email provider, accurately sorted into spam and
non-spam.

Hell, once you're doing analysis of all the possible combinations of test
hits, you hardly even have a use for scores, and can just reduce your
results to "this combination is spam" and "this combination is not spam".
Sexy.

Ooh, I can make the problem clearer.  Currently, score generation won't
trigger unless the mass-check corpora contains 150,000 hams (non-spams) and
150,000 spams.  So say we need 300,000 emails, hand sorted, to calculate
scores.  And the 50_scores.cf file contains 913 rules.  So, for rough
estimation, say that works out to needing 300,000/913 = 328.6 emails per
rule.  Now how many combinations of rules can you come up with if you start
with 913 rules?  I don't remember how to calculate it, but I can tell you
it's freaking huge.  Then multiply it by 328.6 to get the number of emails
we'd need to calculate accurate scores for each combination.  Of course it
would be likely to be useful to only track scores for combinations of up
to, say, 10 rules, which would significantly reduce the problem, but it
would still be nasty.  

Hmm, doesn't look as bad as I thought.  With 913 rules, the number of
combinations of 4 rules looks like 28761672940.  
3 rules: 126424936
2 rules: 416328
1 rule: 913 (yay, this step at least isn't horribly wrong)

So 28761672940+126424936+416328+913 = 2515117 possible combinations of
1 to 4 rules, multiply by 328.6, and we need 9492766067446 emails, hand
sorted into ham and spam, to come up with accurate scores for those
combinations (of just 1-4 tests).  And it looks like we're not currently
even getting enough for score generation to work as is, and that's
still 31 MILLION TIMES the minimum number of emails required by the
current system.

And that still doesn't address the problem of handling emails that hit more
than 4 rules, although, in comparison, I think that one would be easy.


Somebody please show me where I'm wrong on the number of emails required,
and how we can actually make this happen.  Because that would be fun.


http://www.mathsisfun.com/combinatorics/combinations-permutations.html
Combinations without Repetition
http://ardoino.altervista.org/blog/index.php?id=48  - how to do factorials
in bc.

-- 
"But do you have any idea how many SuperBalls you could buy if you
actually applied yourself in the world? Probably eleven, but you should
still try." - http://hyperboleandahalf.blogspot.com/
http://www.ChaosReigns.com


Re: New Bayes like paradigm

2011-09-27 Thread Marc Perkel



On 9/25/2011 5:37 PM, RW wrote:

On Sun, 25 Sep 2011 09:28:32 -0700
Marc Perkel wrote:


Here's what I'd like to be able to do. I'd like a program of some
sort where I could take word tokes - like name of rules that were
triggered - and look for rule combinations that indicate spam or ham.
For example, a message triggers 4 rules A B C and D. These rules are
combined as follows:

A
...
ABCD

Each rule combo is then looked up for how often it occurs in spam and
how often it occurs in ham. Then the results are combined into some
sort of likelihood of being spam or ham.


There are a couple of problems with this. The first is that most SA
rules are either neutral or strong spam indicators, which make them
unsuitable for the sort of techniques used in Bayes.

The second is that most of the scope for meaningful combinations is in
high-scoring spam. Low-scoring spams are low-scoring because SA couldn't
find much evidence - in these you're going to end-up with
meaningless strong+neutral combinations like BAYES_99+SPF_PASS.

That's not to say that it can't be done in a more general sense; the
scoring system is a way  of converting rule combinations into a
classification.

Similar questions have been asked before, IIRC someone came-up with
an alternative way of getting a classification from the rule hits
based on learning, and made a basic plugin that tweaked the score
accordingly.




Here's the kind of think I'm seeing. Spam talks about money - low score. 
Spam talks about Jesus - low score. Spam talks about money and Jesus and 
throw in a dear someone and it's spam. I'm hoping to detect combinations 
automatcally.


--
Marc Perkel - Sales/Support
supp...@junkemailfilter.com
http://www.junkemailfilter.com
Junk Email Filter dot com
415-992-3400



Re: New Bayes like paradigm

2011-09-25 Thread RW
On Sun, 25 Sep 2011 09:28:32 -0700
Marc Perkel wrote:

> Here's what I'd like to be able to do. I'd like a program of some
> sort where I could take word tokes - like name of rules that were
> triggered - and look for rule combinations that indicate spam or ham.
> For example, a message triggers 4 rules A B C and D. These rules are
> combined as follows:
> 
> A
> ...
> ABCD
> 
> Each rule combo is then looked up for how often it occurs in spam and 
> how often it occurs in ham. Then the results are combined into some
> sort of likelihood of being spam or ham.
> 

There are a couple of problems with this. The first is that most SA
rules are either neutral or strong spam indicators, which make them
unsuitable for the sort of techniques used in Bayes. 

The second is that most of the scope for meaningful combinations is in
high-scoring spam. Low-scoring spams are low-scoring because SA couldn't
find much evidence - in these you're going to end-up with
meaningless strong+neutral combinations like BAYES_99+SPF_PASS.   

That's not to say that it can't be done in a more general sense; the
scoring system is a way  of converting rule combinations into a
classification.

Similar questions have been asked before, IIRC someone came-up with
an alternative way of getting a classification from the rule hits
based on learning, and made a basic plugin that tweaked the score
accordingly.


Re: New Bayes like paradigm

2011-09-25 Thread Benny Pedersen

On Sun, 25 Sep 2011 09:28:32 -0700, Marc Perkel wrote:


Hope you all understand what I'm saying here. How would someone do
something like that?


meta foo ((a + b + c + d) > x)

where x is how many of the rules that need to hit

then make __a __b __c __d body header what ever you like to scan for

was it more

meta foo (a && b)

triel and error mode :=)




Re: New Bayes like paradigm

2011-09-25 Thread David F. Skoll
On Sun, 25 Sep 2011 09:28:32 -0700
Marc Perkel  wrote:

> Each rule combo is then looked up for how often it occurs in spam and 
> how often it occurs in ham. Then the results are combined into some
> sort of likelihood of being spam or ham.

We looked at (and even implemented) some "meta-tokens" that we throw
into Bayes.  We found that trying to be too clever is self-defeating; plain
old Bayes is generally better at locking on to spam/ham trends than feeding
it derived data.

Anyway, in theory, SA's genetic weighting algorithm should give you the
desired results, though I suppose a Bayes-like approach might tweak
the rule weights for your specific mail stream.

Regards,

David.


New Bayes like paradigm

2011-09-25 Thread Marc Perkel
Here's what I'd like to be able to do. I'd like a program of some sort 
where I could take word tokes - like name of rules that were triggered - 
and look for rule combinations that indicate spam or ham. For example, a 
message triggers 4 rules A B C and D. These rules are combined as follows:


A
B
C
D
AB
AC
AD
BC
BD
CD
ABC
ABD
ACD
BCD
ABCD

Each rule combo is then looked up for how often it occurs in spam and 
how often it occurs in ham. Then the results are combined into some sort 
of likelihood of being spam or ham.


Hope you all understand what I'm saying here. How would someone do 
something like that?


--
Marc Perkel - Sales/Support
supp...@junkemailfilter.com
http://www.junkemailfilter.com
Junk Email Filter dot com
415-992-3400