http://bugzilla.spamassassin.org/show_bug.cgi?id=3023

           Summary: Detecting random garbage in emails
           Product: Spamassassin
           Version: unspecified
          Platform: Other
        OS/Version: other
            Status: NEW
          Severity: normal
          Priority: P5
         Component: Rules
        AssignedTo: [EMAIL PROTECTED]
        ReportedBy: [EMAIL PROTECTED]


I started receiving lots of messages with random "words" being tacked onto
the spam message.  I know this has been discussed before, but it
seems that the discussions centered around using spell checkers to detect
these giberish words.  Spell checkers have their share of problems...

What if, instead of spell checker, we used good old Bayes db?
>From the scan() function from Bayes.pm:

    my %pw = map {
       ... map token into %pw hash ...
    } @tokens;

The number of keys of %pw is the number of tokens seen before (forget about the
score, all that's important that the token was observed before).  If the ratio
between the number of pw keys and the total number of tokens is too small, it
means that there are a lot of previously unseen tokens.

A rule can be created, much like BAYES_* rules, which assigns different scores
based on this ratio.

I made prelim tests with few messages that slipped through
today (all of the messages had giberish words tackled on).  The ratios for these
messages varied between 0.1 and 0.2.  On the other hand normal messages, or
"normal" spam messages, have this ratio from 0.8-0.99.

This rule should not provide any negative scores even if all of the tokens
where previously seen before (obviously, you don't want to give bonus points
to spam message which you previusly learnt).

Any thoughts on this?



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

Reply via email to