John GALLET writes: > Hi, > > > You run "seek-phrases-in-corpus" over the 2 corpora, and it'll spit out > > the patterns; you can then write rules based on these. > > I did so, the results are interesting, though I do not really know where > to go from there. If I take the first 50 "best" patterns and strip off the > obvious stand-alone words and sure-to-be-false-positive expressions, here > is what I get to: (sorry for non French speakers, explanation below) > > RATIO SPAM% HAM% DATA > 1.000 9.375 0.000 /Pour ne plus recevoir / > 1.000 6.875 0.000 /6 janvier 1978 relative / > 1.000 6.875 0.000 /affiche pas correctement, vous pouvez le visualiser > en/ > 1.000 5.625 0.000 /s données nominatives / > 1.000 5.625 0.000 / ce message, cliquez-ici/ > 1.000 5.625 0.000 / vous désinscrire de / > 1.000 5.000 0.000 /Conformément à l/ > 1.000 5.000 0.000 / plus recevoir d\'informations de notre part/ > 1.000 5.000 0.000 /un droit d\'accès/ > 1.000 4.375 0.000 /ment Ã| l\'article 34 de la loi/ > 1.000 4.375 0.000 /ment à l\'article 34 de la loi / > 1.000 3.750 0.000 /ous désinscrire de notre / > 1.000 3.750 0.000 /es nominatives vous concernant\. / > 1.000 3.750 0.000 / Libertés du 6 / > 1.000 3.750 0.000 /es vous concernant\. Pour l\'exercer, / > > As you can see, charset encoding makes a mess, and many must be regrouped.
> Anyway, these are the patterns I tried to code in FR_SPAMISLEGAL and > FR_HOWTOUNSUBSCRIBE, plus one I considered too generic (if you can't > read this mail in html, click here). It might be worth collecting more ham that includes any such common text -- or even _generating_ mails along those lines (just edit the message body to include the text you want the ruleset to avoid. ;) > The whole result is available at > http://www.saphirtech.fr/spam/seekrules_fr_1.txt > > > http://taint.org/x/2008/seekrules_run > > I also adapted this one (paths of course, but also forced "mbox" format, > "detect" spit out zero results) ah. forgot to mention: detect only treats files that end in ".mbox" as mboxes. ;) > , but the result is even less "readable" > for me. I miss the script seekrules/kill_bad_patterns which I presume > removes stand alone words and such things. yes, I left that out. it's very specific to my spamtraps, since it removes noise added by some of them. > Whole result at http://www.saphirtech.fr/spam/seekrules_fr_2.txt > > John Thanks for trying it out! --j.