Re: seekrules over French spam (was Re: [Rule Set proposal] French Rules

2008-06-27 Thread mouss

John GALLET wrote:

Re,


Anyway, these are the patterns I tried to code in FR_SPAMISLEGAL and
FR_HOWTOUNSUBSCRIBE, plus one I considered too generic (if you can't
read this mail in html, click here).


It might be worth collecting more ham that includes any such common
text -- or even _generating_ mails along those lines (just edit the
message body to include the text you want the ruleset to avoid. ;)


Well, that's the whole point: can we conclude that an email with an 
unsubcribe link tends to be a spam more often than a ham ? I consider 
so, but with a low score. Can we conclude that an email citing the 
French Law "informatique et libertés" is a spam ? I would say "100% 
except government sponsored mailing lists that may feel obliged to do 
so", so I added a higher score. Now it might perfectly be faulty 
logic, I do not have any experience in spam fighting.


many mailing lists and safe newsletters contain such links. examples:
- mailing lists hosted by ovh
- HSC newsletter (Herve is not the kind of guy to participate in spam)
- Ciel (if you junk this, your accountants may junk your salary :)
- Air France (I want my tickets!)
...


same goes for "legal" stuff (nobody wants to miss his Air France 
electronic tickets...)


Things get even worst when ads are included in important mail. here is an 
excerpt for a mail from SNCF (confirmation):


Pas encore membre ?
Inscrivez-vous dès aujourd'hui et gagnez déjà 100 Maximiles de bienvenue !
Pour en savoir plus, cliquez ici 





and french members may remember that maximiles participated to the infamous "Sarkozy spam". but apparently, they got cleaner since then (the address I use at sncf is [EMAIL PROTECTED], so I wouldn't miss it if they use it!). 



here is an excerpt from a safe (and actually relatively closed) newsletter.

Non, ceci n'est pas un SPAM, c'est la  lettre d'information de
...
Si vous désirez vous désabonner, ...


(of course, you can argue that a message may not be a "SPAM" because you can't eat an email. but let's not be too pedantic :-p). 



I did not run your rules on my corpus. I'll try to do so but my spam corpus is 
not classified by language.












Re: seekrules over French spam (was Re: [Rule Set proposal] French Rules

2008-06-24 Thread Justin Mason

John Wilcock writes:
> Justin Mason a écrit :
> > John GALLET writes:
> >> Well, thanks for writing it. I think its main weak point for French and 
> >> other accented languages is handling the different encodings for a same 
> >> char with an accent, some kind of "synonyms" list. The same letter, say "a 
> >> with an accent", can be misspelled with a plain "a", encoded in various 
> >> charsets (latin, utf-8) to a "normal" à, or html encoded agrave (I left & 
> >> and ; out). I do not know if it is possible at all, it might complicate 
> >> things *a lot*.
> > 
> > The tool can take care of this -- it will replace mutating single-characters
> > with a /./.  It also supports /.?/, /.{0,3}/, /.{0,10}/ and a few other
> > "any" patterns.
> 
> If the number of permutations is small (as would be the case for
> accented letters and the equivalent unaccented ones, or for that matter
> obfuscation with lookalike characters), wouldn't it be better for it to
> replace the character by a [] list of those permutations (i.e. replace
> something that mutates between e and é with [eé] or replace obfuscation
> of i with l and 1 by [il1] ?

It would be.  but fixing the pattern-discovery algorithm to discover this
in a relatively speedy way is not so easy.  Patches accepted ;)


Re: seekrules over French spam (was Re: [Rule Set proposal] French Rules

2008-06-24 Thread John Wilcock

Justin Mason a écrit :

John GALLET writes:
Well, thanks for writing it. I think its main weak point for French and 
other accented languages is handling the different encodings for a same 
char with an accent, some kind of "synonyms" list. The same letter, say "a 
with an accent", can be misspelled with a plain "a", encoded in various 
charsets (latin, utf-8) to a "normal" à, or html encoded agrave (I left & 
and ; out). I do not know if it is possible at all, it might complicate 
things *a lot*.


The tool can take care of this -- it will replace mutating single-characters
with a /./.  It also supports /.?/, /.{0,3}/, /.{0,10}/ and a few other
"any" patterns.


If the number of permutations is small (as would be the case for
accented letters and the equivalent unaccented ones, or for that matter
obfuscation with lookalike characters), wouldn't it be better for it to
replace the character by a [] list of those permutations (i.e. replace
something that mutates between e and é with [eé] or replace obfuscation
of i with l and 1 by [il1] ?

John.

--
-- Over 3000 webcams from ski resorts around the world - www.snoweye.com
-- Translate your technical documents and web pages- www.tradoc.fr



Re: seekrules over French spam (was Re: [Rule Set proposal] French Rules

2008-06-24 Thread Justin Mason

John GALLET writes:
> Re,
> 
> >> Anyway, these are the patterns I tried to code in FR_SPAMISLEGAL and
> >> FR_HOWTOUNSUBSCRIBE, plus one I considered too generic (if you can't
> >> read this mail in html, click here).
> >
> > It might be worth collecting more ham that includes any such common
> > text -- or even _generating_ mails along those lines (just edit the
> > message body to include the text you want the ruleset to avoid. ;)
> 
> Well, that's the whole point: can we conclude that an email with an 
> unsubcribe link tends to be a spam more often than a ham ? I consider so, 
> but with a low score. Can we conclude that an email citing the French Law 
> "informatique et libertés" is a spam ? I would say "100% except government 
> sponsored mailing lists that may feel obliged to do so", so I added a 
> higher score. Now it might perfectly be faulty logic, I do not have any 
> experience in spam fighting.

Well, with automated rule-set generation I would advise erring on the
side of "no false positives" -- my experience with FPs is that they 
may appear to be infrequent in one corpus, and then be 10x as frequent
in another person's corpus, just due to the kind of ham he/she gets.

> >> I also adapted this one (paths of course, but also forced "mbox" format,
> >> "detect" spit out zero results)
> > ah.  forgot to mention: detect only treats files that end in ".mbox" as
> > mboxes. ;)
> 
> :-) ok, well anyway it was quite easy to find out since it worked well 
> when forcing and not at all in automatic.
> 
> > Thanks for trying it out!
> 
> Well, thanks for writing it. I think its main weak point for French and 
> other accented languages is handling the different encodings for a same 
> char with an accent, some kind of "synonyms" list. The same letter, say "a 
> with an accent", can be misspelled with a plain "a", encoded in various 
> charsets (latin, utf-8) to a "normal" à, or html encoded agrave (I left & 
> and ; out). I do not know if it is possible at all, it might complicate 
> things *a lot*.

The tool can take care of this -- it will replace mutating single-characters
with a /./.  It also supports /.?/, /.{0,3}/, /.{0,10}/ and a few other
"any" patterns.

--j.


Re: seekrules over French spam (was Re: [Rule Set proposal] French Rules

2008-06-24 Thread John GALLET

Re,


Anyway, these are the patterns I tried to code in FR_SPAMISLEGAL and
FR_HOWTOUNSUBSCRIBE, plus one I considered too generic (if you can't
read this mail in html, click here).


It might be worth collecting more ham that includes any such common
text -- or even _generating_ mails along those lines (just edit the
message body to include the text you want the ruleset to avoid. ;)


Well, that's the whole point: can we conclude that an email with an 
unsubcribe link tends to be a spam more often than a ham ? I consider so, 
but with a low score. Can we conclude that an email citing the French Law 
"informatique et libertés" is a spam ? I would say "100% except government 
sponsored mailing lists that may feel obliged to do so", so I added a 
higher score. Now it might perfectly be faulty logic, I do not have any 
experience in spam fighting.



I also adapted this one (paths of course, but also forced "mbox" format,
"detect" spit out zero results)

ah.  forgot to mention: detect only treats files that end in ".mbox" as
mboxes. ;)


:-) ok, well anyway it was quite easy to find out since it worked well 
when forcing and not at all in automatic.



Thanks for trying it out!


Well, thanks for writing it. I think its main weak point for French and 
other accented languages is handling the different encodings for a same 
char with an accent, some kind of "synonyms" list. The same letter, say "a 
with an accent", can be misspelled with a plain "a", encoded in various 
charsets (latin, utf-8) to a "normal" à, or html encoded agrave (I left & 
and ; out). I do not know if it is possible at all, it might complicate 
things *a lot*.


a++;
JG


Re: seekrules over French spam (was Re: [Rule Set proposal] French Rules

2008-06-24 Thread Justin Mason

John GALLET writes:
> Hi,
> 
> > You run "seek-phrases-in-corpus" over the 2 corpora, and it'll spit out
> > the patterns; you can then write rules based on these.
> 
> I did so, the results are interesting, though I do not really know where 
> to go from there. If I take the first 50 "best" patterns and strip off the 
> obvious stand-alone words and sure-to-be-false-positive expressions, here 
> is what I get to: (sorry for non French speakers, explanation below)
> 
>   RATIO   SPAM%HAM%   DATA
>   1.000   9.375   0.000  /Pour ne plus recevoir /
>   1.000   6.875   0.000  /6 janvier 1978 relative /
>   1.000   6.875   0.000  /affiche pas correctement, vous pouvez le visualiser 
> en/
>   1.000   5.625   0.000  /s données nominatives /
>   1.000   5.625   0.000  / ce message, cliquez-ici/
>   1.000   5.625   0.000  / vous désinscrire de /
>   1.000   5.000   0.000  /Conformément à l/
>   1.000   5.000   0.000  / plus recevoir d\'informations de notre part/
>   1.000   5.000   0.000  /un droit d\'accès/
>   1.000   4.375   0.000  /ment Ã|  l\'article 34 de la loi/
>   1.000   4.375   0.000  /ment à l\'article 34 de la loi /
>   1.000   3.750   0.000  /ous désinscrire de notre /
>   1.000   3.750   0.000  /es nominatives vous concernant\. /
>   1.000   3.750   0.000  / Libertés du 6 /
>   1.000   3.750   0.000  /es vous concernant\. Pour l\'exercer, /
> 
> As you can see, charset encoding makes a mess, and many must be regrouped.

> Anyway, these are the patterns I tried to code in FR_SPAMISLEGAL and 
> FR_HOWTOUNSUBSCRIBE, plus one I considered too generic (if you can't 
> read this mail in html, click here).

It might be worth collecting more ham that includes any such common
text -- or even _generating_ mails along those lines (just edit the
message body to include the text you want the ruleset to avoid. ;)

> The whole result is available at 
> http://www.saphirtech.fr/spam/seekrules_fr_1.txt
> 
> >  http://taint.org/x/2008/seekrules_run
> 
> I also adapted this one (paths of course, but also forced "mbox" format, 
> "detect" spit out zero results)

ah.  forgot to mention: detect only treats files that end in ".mbox" as 
mboxes. ;)

> , but the result is even less "readable" 
> for me. I miss the script seekrules/kill_bad_patterns which I presume 
> removes stand alone words and such things.

yes, I left that out.  it's very specific to my spamtraps, since it
removes noise added by some of them.

> Whole result at http://www.saphirtech.fr/spam/seekrules_fr_2.txt
> 
> John

Thanks for trying it out!

--j.