Re,

Anyway, these are the patterns I tried to code in FR_SPAMISLEGAL and
FR_HOWTOUNSUBSCRIBE, plus one I considered too generic (if you can't
read this mail in html, click here).

It might be worth collecting more ham that includes any such common
text -- or even _generating_ mails along those lines (just edit the
message body to include the text you want the ruleset to avoid. ;)

Well, that's the whole point: can we conclude that an email with an unsubcribe link tends to be a spam more often than a ham ? I consider so, but with a low score. Can we conclude that an email citing the French Law "informatique et libertés" is a spam ? I would say "100% except government sponsored mailing lists that may feel obliged to do so", so I added a higher score. Now it might perfectly be faulty logic, I do not have any experience in spam fighting.

I also adapted this one (paths of course, but also forced "mbox" format,
"detect" spit out zero results)
ah.  forgot to mention: detect only treats files that end in ".mbox" as
mboxes. ;)

:-) ok, well anyway it was quite easy to find out since it worked well when forcing and not at all in automatic.

Thanks for trying it out!

Well, thanks for writing it. I think its main weak point for French and other accented languages is handling the different encodings for a same char with an accent, some kind of "synonyms" list. The same letter, say "a with an accent", can be misspelled with a plain "a", encoded in various charsets (latin, utf-8) to a "normal" à, or html encoded agrave (I left & and ; out). I do not know if it is possible at all, it might complicate things *a lot*.

a++;
JG

Reply via email to