I'm putting up a demo/prototype of some new techniques I'm building
for datamining and analysis.

This tool scans two large corpi of 500mb or more of email to identify
any substrings that occurs frequently in one but infrequently in the
other. You can choose the limits for 'frequently' and 'infrequently'.
It then reports all such substrings.

To use, please see my webpage on this work at:

    http://www.cs.rice.edu/~scrosby/datamining/

I'd say to use this program for inspiration of new rules. If you have
a gob of email and you want to know what is unique about it, this can
help find some suggestions. I've used it to look at the difference
between caught spam and missed spam and ham versus spam.

Some ideas for using:

  1. Run two full corpuses through the program.
  2. Run just the headers of two corpuses through the program.
  3. Run just a particular header 'X-Mailer' through the program.

I cannot use this prototype because it immediately finds the spoor of
SA all over the place, in the folder classification, SA headers, and
even the artificial Received line that SA puts when it encapsulates a
message. So for now, a clean corpus is absolutely critical, and I do
not have that and cannot build one. Also this program is unaware of
email boundaries, so a particular HTML element will be counted as many
times as it occurs, not the number of messages in which it occurs. It
may be easier to use with HTML removed. In the future these problems
will hopefully be removed.

Samples of the output include:

(in headers only)
        1110    3       800\nX-Priority:
        1108    3       0800\nX-Priority
        1107    3       +0800\nX-Priorit
        1106    3        +0800\nX-Priori

Timezone might be a good bayes token ^^^

        402     0       -Mailer: FoxMai
        402     0       X-Mailer: FoxMa
        402     0       Mailer: FoxMail

Ratware? ^^^

        820     2       iority: 3\nX-Mai

X-Priority: 3 header?


        2155    8       y=\"----------=_
        2154    8       ry=\"----------=

I don't get much MIME except spam, so this is probably that.

        194     0       m (unknown [61.

Part of a popular faked receive line? Dunno. ^^^

        121     0       2919.6900 DM\nMI

Portion of a particular outlook version line followed by MIME header. ^^^


        85      0       essage-Id: <000
        75      0       X-Priority: 4\nX

X-Priority = number? ^^^ 

        162     2       : 3\nX-Library: 
        163     3       lain\nX-Priority
        163     3       plain\nX-Priorit
        163     3       xt/plain\nX-Prio
        227     0        [61.51.
        227     0       n [61.51
        2160    9       ------=_
        143     0       0000\nMessage-Id


(in header&body)


        3660    0       
\161\161\161\161\161\161\161\161\161\161\161\161\161\161\161\161\161\161\161\161
        1983    0       $$$$$$$$$$$$$$$$$$$$
        1571    0        face=\"\183\194\203\206_GB2312\">
Asian ^^^^

        1824    0       http://love.elong.co


        28      1       looking statements, 
        28      1       your prompt response
        28      1       ve hundred thousand 
        29      1       : Foxmail 4.2 [cn]\nM
        29      1       -looking statements,
        29      1       of this transaction 

^^^ The hits for these nigerian spams was a false negative I didn't
remove from my clean corpus Note the myriad phrases that are repeated
in all 38 of these emails.

        44      2       how to stop further 
        30      1       in this transaction.
        52      2       in\nX-Priority: 3\nX-M
        35      1       '; mso-bidi-font-siz
        37      1       excellent opportunit
        37      1       for you to participa
        37      1       is an\nexcellent oppo
        37      1       ntinuing with this e
        37      1       formation will help 
        37      1       pportunity for you\nt
        37      1       understand that I ca
        37      1       r you to participate
        37      1        we have developed a
        39      1       formation on mortgag
        39      1       1001.lunchboxx.net>\n
        41      1        <TR>\n    <TD>\n     
        41      1        coupons, discounts 
        42      1       000 \n            siz
        43      1       -Type: MULTIPART/alt

^^^^ Capitalized MULTIPART


If you find this useful, please send me a heads-up. 

Scott


-------------------------------------------------------
The SF.Net email is sponsored by EclipseCon 2004
Premiere Conference on Open Tools Development and Integration
See the breadth of Eclipse activity. February 3-5 in Anaheim, CA.
http://www.eclipsecon.org/osdn
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to