John GALLET writes:
> Hi,
> 
> > You run "seek-phrases-in-corpus" over the 2 corpora, and it'll spit out
> > the patterns; you can then write rules based on these.
> 
> I did so, the results are interesting, though I do not really know where 
> to go from there. If I take the first 50 "best" patterns and strip off the 
> obvious stand-alone words and sure-to-be-false-positive expressions, here 
> is what I get to: (sorry for non French speakers, explanation below)
> 
>   RATIO   SPAM%    HAM%   DATA
>   1.000   9.375   0.000  /Pour ne plus recevoir /
>   1.000   6.875   0.000  /6 janvier 1978 relative /
>   1.000   6.875   0.000  /affiche pas correctement, vous pouvez le visualiser 
> en/
>   1.000   5.625   0.000  /s données nominatives /
>   1.000   5.625   0.000  / ce message, cliquez-ici/
>   1.000   5.625   0.000  / vous désinscrire de /
>   1.000   5.000   0.000  /Conformément à l/
>   1.000   5.000   0.000  / plus recevoir d\'informations de notre part/
>   1.000   5.000   0.000  /un droit d\'accès/
>   1.000   4.375   0.000  /ment Ã|  l\'article 34 de la loi/
>   1.000   4.375   0.000  /ment à l\'article 34 de la loi /
>   1.000   3.750   0.000  /ous désinscrire de notre /
>   1.000   3.750   0.000  /es nominatives vous concernant\. /
>   1.000   3.750   0.000  / Libertés du 6 /
>   1.000   3.750   0.000  /es vous concernant\. Pour l\'exercer, /
> 
> As you can see, charset encoding makes a mess, and many must be regrouped.

> Anyway, these are the patterns I tried to code in FR_SPAMISLEGAL and 
> FR_HOWTOUNSUBSCRIBE, plus one I considered too generic (if you can't 
> read this mail in html, click here).

It might be worth collecting more ham that includes any such common
text -- or even _generating_ mails along those lines (just edit the
message body to include the text you want the ruleset to avoid. ;)

> The whole result is available at 
> http://www.saphirtech.fr/spam/seekrules_fr_1.txt
> 
> >  http://taint.org/x/2008/seekrules_run
> 
> I also adapted this one (paths of course, but also forced "mbox" format, 
> "detect" spit out zero results)

ah.  forgot to mention: detect only treats files that end in ".mbox" as 
mboxes. ;)

> , but the result is even less "readable" 
> for me. I miss the script seekrules/kill_bad_patterns which I presume 
> removes stand alone words and such things.

yes, I left that out.  it's very specific to my spamtraps, since it
removes noise added by some of them.

> Whole result at http://www.saphirtech.fr/spam/seekrules_fr_2.txt
> 
> John

Thanks for trying it out!

--j.

Reply via email to