I'm putting up a demo/prototype of some new techniques I'm building for datamining and analysis.
This tool scans two large corpi of 500mb or more of email to identify any substrings that occurs frequently in one but infrequently in the other. You can choose the limits for 'frequently' and 'infrequently'. It then reports all such substrings. To use, please see my webpage on this work at: http://www.cs.rice.edu/~scrosby/datamining/ I'd say to use this program for inspiration of new rules. If you have a gob of email and you want to know what is unique about it, this can help find some suggestions. I've used it to look at the difference between caught spam and missed spam and ham versus spam. Some ideas for using: 1. Run two full corpuses through the program. 2. Run just the headers of two corpuses through the program. 3. Run just a particular header 'X-Mailer' through the program. I cannot use this prototype because it immediately finds the spoor of SA all over the place, in the folder classification, SA headers, and even the artificial Received line that SA puts when it encapsulates a message. So for now, a clean corpus is absolutely critical, and I do not have that and cannot build one. Also this program is unaware of email boundaries, so a particular HTML element will be counted as many times as it occurs, not the number of messages in which it occurs. It may be easier to use with HTML removed. In the future these problems will hopefully be removed. Samples of the output include: (in headers only) 1110 3 800\nX-Priority: 1108 3 0800\nX-Priority 1107 3 +0800\nX-Priorit 1106 3 +0800\nX-Priori Timezone might be a good bayes token ^^^ 402 0 -Mailer: FoxMai 402 0 X-Mailer: FoxMa 402 0 Mailer: FoxMail Ratware? ^^^ 820 2 iority: 3\nX-Mai X-Priority: 3 header? 2155 8 y=\"----------=_ 2154 8 ry=\"----------= I don't get much MIME except spam, so this is probably that. 194 0 m (unknown [61. Part of a popular faked receive line? Dunno. ^^^ 121 0 2919.6900 DM\nMI Portion of a particular outlook version line followed by MIME header. ^^^ 85 0 essage-Id: <000 75 0 X-Priority: 4\nX X-Priority = number? ^^^ 162 2 : 3\nX-Library: 163 3 lain\nX-Priority 163 3 plain\nX-Priorit 163 3 xt/plain\nX-Prio 227 0 [61.51. 227 0 n [61.51 2160 9 ------=_ 143 0 0000\nMessage-Id (in header&body) 3660 0 \161\161\161\161\161\161\161\161\161\161\161\161\161\161\161\161\161\161\161\161 1983 0 $$$$$$$$$$$$$$$$$$$$ 1571 0 face=\"\183\194\203\206_GB2312\"> Asian ^^^^ 1824 0 http://love.elong.co 28 1 looking statements, 28 1 your prompt response 28 1 ve hundred thousand 29 1 : Foxmail 4.2 [cn]\nM 29 1 -looking statements, 29 1 of this transaction ^^^ The hits for these nigerian spams was a false negative I didn't remove from my clean corpus Note the myriad phrases that are repeated in all 38 of these emails. 44 2 how to stop further 30 1 in this transaction. 52 2 in\nX-Priority: 3\nX-M 35 1 '; mso-bidi-font-siz 37 1 excellent opportunit 37 1 for you to participa 37 1 is an\nexcellent oppo 37 1 ntinuing with this e 37 1 formation will help 37 1 pportunity for you\nt 37 1 understand that I ca 37 1 r you to participate 37 1 we have developed a 39 1 formation on mortgag 39 1 1001.lunchboxx.net>\n 41 1 <TR>\n <TD>\n 41 1 coupons, discounts 42 1 000 \n siz 43 1 -Type: MULTIPART/alt ^^^^ Capitalized MULTIPART If you find this useful, please send me a heads-up. Scott ------------------------------------------------------- The SF.Net email is sponsored by EclipseCon 2004 Premiere Conference on Open Tools Development and Integration See the breadth of Eclipse activity. February 3-5 in Anaheim, CA. http://www.eclipsecon.org/osdn _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk