John GALLET writes:
> Re,
> 
> >> Anyway, these are the patterns I tried to code in FR_SPAMISLEGAL and
> >> FR_HOWTOUNSUBSCRIBE, plus one I considered too generic (if you can't
> >> read this mail in html, click here).
> >
> > It might be worth collecting more ham that includes any such common
> > text -- or even _generating_ mails along those lines (just edit the
> > message body to include the text you want the ruleset to avoid. ;)
> 
> Well, that's the whole point: can we conclude that an email with an 
> unsubcribe link tends to be a spam more often than a ham ? I consider so, 
> but with a low score. Can we conclude that an email citing the French Law 
> "informatique et libertés" is a spam ? I would say "100% except government 
> sponsored mailing lists that may feel obliged to do so", so I added a 
> higher score. Now it might perfectly be faulty logic, I do not have any 
> experience in spam fighting.

Well, with automated rule-set generation I would advise erring on the
side of "no false positives" -- my experience with FPs is that they 
may appear to be infrequent in one corpus, and then be 10x as frequent
in another person's corpus, just due to the kind of ham he/she gets.

> >> I also adapted this one (paths of course, but also forced "mbox" format,
> >> "detect" spit out zero results)
> > ah.  forgot to mention: detect only treats files that end in ".mbox" as
> > mboxes. ;)
> 
> :-) ok, well anyway it was quite easy to find out since it worked well 
> when forcing and not at all in automatic.
> 
> > Thanks for trying it out!
> 
> Well, thanks for writing it. I think its main weak point for French and 
> other accented languages is handling the different encodings for a same 
> char with an accent, some kind of "synonyms" list. The same letter, say "a 
> with an accent", can be misspelled with a plain "a", encoded in various 
> charsets (latin, utf-8) to a "normal" à, or html encoded agrave (I left & 
> and ; out). I do not know if it is possible at all, it might complicate 
> things *a lot*.

The tool can take care of this -- it will replace mutating single-characters
with a /./.  It also supports /.?/, /.{0,3}/, /.{0,10}/ and a few other
"any" patterns.

--j.

Reply via email to