Hi!

I suggested this once before, and did not see any response.
Many rules that I see suggested on this list all have the characteristic
of being a good test against e-mail that contain a large number of
occurences (a high 'count') of a particular 'trick' or 'obfuscation'.
BUT these rules have to be scored very LOW because sometimes legitimate
mail contains one or two occurences of the same text/string.

For example, Someone might include a legitimate Acronym, such as
I.B.M. or I.B.E.W. and this would trigger a rule to check for a single 
occurence of 'period obfuscated text'. But if we were able to check the
COUNT of how many times a particular rule was matched, we could easily
distinguish runaway use of obfuscation.

Now, if the current rule-checking logic has been optimized to stop after
it finds a successful match, then we would need an extra parameter to 
tell the test to keep going and count all occurences. Then, we would need 
a parameter on the 'score' line to work with those counts.
Here would be a coding example, based on Jennifer's period checker:

body LOC_PERIODS      count /\s[a-zA-Z]{9}\.[a-zA-Z]{1}[ ,'\?!]/i
describe LOC_PERIODS  Too many words with period spacing
score LOC_PERIODS     5:0.5,10:1.2 

Meaning in this case, score 0.5 for a count of 5 or higher, and 1.2 for a
count of 10 or higher. As per other scoring lines, you could have
up to four space separated groups of scores.

Note that we do not want to use a straight *multiplier* as there will be
cases where we want to have no score until a certain minimum threshold is
reached. In the above example, up to 4 instances of period spaced words
would score nothing at all....

In terms of program logic, the main change would be:
   - recognizing the 'count' parameter on the rule and accumulating the
count, as well as insuring that testing doesn't stop on the first match.
   - on the scoring, recognizing the 'x:y' pairs as being count related.
   - A simple error condition check for:
      - count-style scoring (x:y) for a rule that didn't use the 'count'
        option.
      - normal style scoring (x) for a rule that used the 'count' option.

So, how's that grab people?  This would be a fundamental change, affecting
the basic behaviour of every test except for the 'evals' - and even then
with clever coding it might be applied to those. But I don't think it
would be a lot of code. It would probably take longer to document the new
usage.... :-)

- Charles



-------------------------------------------------------
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to