Re: seekrules over French spam (was Re: [Rule Set proposal] French Rules

2008-06-27 Thread mouss

John GALLET wrote:

Re,


Anyway, these are the patterns I tried to code in FR_SPAMISLEGAL and
FR_HOWTOUNSUBSCRIBE, plus one I considered too generic (if you can't
read this mail in html, click here).


It might be worth collecting more ham that includes any such common
text -- or even _generating_ mails along those lines (just edit the
message body to include the text you want the ruleset to avoid. ;)


Well, that's the whole point: can we conclude that an email with an 
unsubcribe link tends to be a spam more often than a ham ? I consider 
so, but with a low score. Can we conclude that an email citing the 
French Law "informatique et libertés" is a spam ? I would say "100% 
except government sponsored mailing lists that may feel obliged to do 
so", so I added a higher score. Now it might perfectly be faulty 
logic, I do not have any experience in spam fighting.


many mailing lists and safe newsletters contain such links. examples:
- mailing lists hosted by ovh
- HSC newsletter (Herve is not the kind of guy to participate in spam)
- Ciel (if you junk this, your accountants may junk your salary :)
- Air France (I want my tickets!)
...


same goes for "legal" stuff (nobody wants to miss his Air France 
electronic tickets...)


Things get even worst when ads are included in important mail. here is an 
excerpt for a mail from SNCF (confirmation):


Pas encore membre ?
Inscrivez-vous dès aujourd'hui et gagnez déjà 100 Maximiles de bienvenue !
Pour en savoir plus, cliquez ici 





and french members may remember that maximiles participated to the infamous "Sarkozy spam". but apparently, they got cleaner since then (the address I use at sncf is [EMAIL PROTECTED], so I wouldn't miss it if they use it!). 



here is an excerpt from a safe (and actually relatively closed) newsletter.

Non, ceci n'est pas un SPAM, c'est la  lettre d'information de
...
Si vous désirez vous désabonner, ...


(of course, you can argue that a message may not be a "SPAM" because you can't eat an email. but let's not be too pedantic :-p). 



I did not run your rules on my corpus. I'll try to do so but my spam corpus is 
not classified by language.












Re: seekrules over French spam (was Re: [Rule Set proposal] French Rules

2008-06-24 Thread Justin Mason

John Wilcock writes:
> Justin Mason a écrit :
> > John GALLET writes:
> >> Well, thanks for writing it. I think its main weak point for French and 
> >> other accented languages is handling the different encodings for a same 
> >> char with an accent, some kind of "synonyms" list. The same letter, say "a 
> >> with an accent", can be misspelled with a plain "a", encoded in various 
> >> charsets (latin, utf-8) to a "normal" à, or html encoded agrave (I left & 
> >> and ; out). I do not know if it is possible at all, it might complicate 
> >> things *a lot*.
> > 
> > The tool can take care of this -- it will replace mutating single-characters
> > with a /./.  It also supports /.?/, /.{0,3}/, /.{0,10}/ and a few other
> > "any" patterns.
> 
> If the number of permutations is small (as would be the case for
> accented letters and the equivalent unaccented ones, or for that matter
> obfuscation with lookalike characters), wouldn't it be better for it to
> replace the character by a [] list of those permutations (i.e. replace
> something that mutates between e and é with [eé] or replace obfuscation
> of i with l and 1 by [il1] ?

It would be.  but fixing the pattern-discovery algorithm to discover this
in a relatively speedy way is not so easy.  Patches accepted ;)


Re: seekrules over French spam (was Re: [Rule Set proposal] French Rules

2008-06-24 Thread John Wilcock

Justin Mason a écrit :

John GALLET writes:
Well, thanks for writing it. I think its main weak point for French and 
other accented languages is handling the different encodings for a same 
char with an accent, some kind of "synonyms" list. The same letter, say "a 
with an accent", can be misspelled with a plain "a", encoded in various 
charsets (latin, utf-8) to a "normal" à, or html encoded agrave (I left & 
and ; out). I do not know if it is possible at all, it might complicate 
things *a lot*.


The tool can take care of this -- it will replace mutating single-characters
with a /./.  It also supports /.?/, /.{0,3}/, /.{0,10}/ and a few other
"any" patterns.


If the number of permutations is small (as would be the case for
accented letters and the equivalent unaccented ones, or for that matter
obfuscation with lookalike characters), wouldn't it be better for it to
replace the character by a [] list of those permutations (i.e. replace
something that mutates between e and é with [eé] or replace obfuscation
of i with l and 1 by [il1] ?

John.

--
-- Over 3000 webcams from ski resorts around the world - www.snoweye.com
-- Translate your technical documents and web pages- www.tradoc.fr



Re: seekrules over French spam (was Re: [Rule Set proposal] French Rules

2008-06-24 Thread Justin Mason

John GALLET writes:
> Re,
> 
> >> Anyway, these are the patterns I tried to code in FR_SPAMISLEGAL and
> >> FR_HOWTOUNSUBSCRIBE, plus one I considered too generic (if you can't
> >> read this mail in html, click here).
> >
> > It might be worth collecting more ham that includes any such common
> > text -- or even _generating_ mails along those lines (just edit the
> > message body to include the text you want the ruleset to avoid. ;)
> 
> Well, that's the whole point: can we conclude that an email with an 
> unsubcribe link tends to be a spam more often than a ham ? I consider so, 
> but with a low score. Can we conclude that an email citing the French Law 
> "informatique et libertés" is a spam ? I would say "100% except government 
> sponsored mailing lists that may feel obliged to do so", so I added a 
> higher score. Now it might perfectly be faulty logic, I do not have any 
> experience in spam fighting.

Well, with automated rule-set generation I would advise erring on the
side of "no false positives" -- my experience with FPs is that they 
may appear to be infrequent in one corpus, and then be 10x as frequent
in another person's corpus, just due to the kind of ham he/she gets.

> >> I also adapted this one (paths of course, but also forced "mbox" format,
> >> "detect" spit out zero results)
> > ah.  forgot to mention: detect only treats files that end in ".mbox" as
> > mboxes. ;)
> 
> :-) ok, well anyway it was quite easy to find out since it worked well 
> when forcing and not at all in automatic.
> 
> > Thanks for trying it out!
> 
> Well, thanks for writing it. I think its main weak point for French and 
> other accented languages is handling the different encodings for a same 
> char with an accent, some kind of "synonyms" list. The same letter, say "a 
> with an accent", can be misspelled with a plain "a", encoded in various 
> charsets (latin, utf-8) to a "normal" à, or html encoded agrave (I left & 
> and ; out). I do not know if it is possible at all, it might complicate 
> things *a lot*.

The tool can take care of this -- it will replace mutating single-characters
with a /./.  It also supports /.?/, /.{0,3}/, /.{0,10}/ and a few other
"any" patterns.

--j.


Re: seekrules over French spam (was Re: [Rule Set proposal] French Rules

2008-06-24 Thread John GALLET

Re,


Anyway, these are the patterns I tried to code in FR_SPAMISLEGAL and
FR_HOWTOUNSUBSCRIBE, plus one I considered too generic (if you can't
read this mail in html, click here).


It might be worth collecting more ham that includes any such common
text -- or even _generating_ mails along those lines (just edit the
message body to include the text you want the ruleset to avoid. ;)


Well, that's the whole point: can we conclude that an email with an 
unsubcribe link tends to be a spam more often than a ham ? I consider so, 
but with a low score. Can we conclude that an email citing the French Law 
"informatique et libertés" is a spam ? I would say "100% except government 
sponsored mailing lists that may feel obliged to do so", so I added a 
higher score. Now it might perfectly be faulty logic, I do not have any 
experience in spam fighting.



I also adapted this one (paths of course, but also forced "mbox" format,
"detect" spit out zero results)

ah.  forgot to mention: detect only treats files that end in ".mbox" as
mboxes. ;)


:-) ok, well anyway it was quite easy to find out since it worked well 
when forcing and not at all in automatic.



Thanks for trying it out!


Well, thanks for writing it. I think its main weak point for French and 
other accented languages is handling the different encodings for a same 
char with an accent, some kind of "synonyms" list. The same letter, say "a 
with an accent", can be misspelled with a plain "a", encoded in various 
charsets (latin, utf-8) to a "normal" à, or html encoded agrave (I left & 
and ; out). I do not know if it is possible at all, it might complicate 
things *a lot*.


a++;
JG


Re: seekrules over French spam (was Re: [Rule Set proposal] French Rules

2008-06-24 Thread Justin Mason

John GALLET writes:
> Hi,
> 
> > You run "seek-phrases-in-corpus" over the 2 corpora, and it'll spit out
> > the patterns; you can then write rules based on these.
> 
> I did so, the results are interesting, though I do not really know where 
> to go from there. If I take the first 50 "best" patterns and strip off the 
> obvious stand-alone words and sure-to-be-false-positive expressions, here 
> is what I get to: (sorry for non French speakers, explanation below)
> 
>   RATIO   SPAM%HAM%   DATA
>   1.000   9.375   0.000  /Pour ne plus recevoir /
>   1.000   6.875   0.000  /6 janvier 1978 relative /
>   1.000   6.875   0.000  /affiche pas correctement, vous pouvez le visualiser 
> en/
>   1.000   5.625   0.000  /s données nominatives /
>   1.000   5.625   0.000  / ce message, cliquez-ici/
>   1.000   5.625   0.000  / vous désinscrire de /
>   1.000   5.000   0.000  /Conformément à l/
>   1.000   5.000   0.000  / plus recevoir d\'informations de notre part/
>   1.000   5.000   0.000  /un droit d\'accès/
>   1.000   4.375   0.000  /ment Ã|  l\'article 34 de la loi/
>   1.000   4.375   0.000  /ment à l\'article 34 de la loi /
>   1.000   3.750   0.000  /ous désinscrire de notre /
>   1.000   3.750   0.000  /es nominatives vous concernant\. /
>   1.000   3.750   0.000  / Libertés du 6 /
>   1.000   3.750   0.000  /es vous concernant\. Pour l\'exercer, /
> 
> As you can see, charset encoding makes a mess, and many must be regrouped.

> Anyway, these are the patterns I tried to code in FR_SPAMISLEGAL and 
> FR_HOWTOUNSUBSCRIBE, plus one I considered too generic (if you can't 
> read this mail in html, click here).

It might be worth collecting more ham that includes any such common
text -- or even _generating_ mails along those lines (just edit the
message body to include the text you want the ruleset to avoid. ;)

> The whole result is available at 
> http://www.saphirtech.fr/spam/seekrules_fr_1.txt
> 
> >  http://taint.org/x/2008/seekrules_run
> 
> I also adapted this one (paths of course, but also forced "mbox" format, 
> "detect" spit out zero results)

ah.  forgot to mention: detect only treats files that end in ".mbox" as 
mboxes. ;)

> , but the result is even less "readable" 
> for me. I miss the script seekrules/kill_bad_patterns which I presume 
> removes stand alone words and such things.

yes, I left that out.  it's very specific to my spamtraps, since it
removes noise added by some of them.

> Whole result at http://www.saphirtech.fr/spam/seekrules_fr_2.txt
> 
> John

Thanks for trying it out!

--j.


Re: Philosophy for opt-in (was Re: [Rule Set proposal] French Rules

2008-06-24 Thread John Wilcock

John GALLET a écrit :
I think I have a newbye simple problem of philosophy/strategy: my 
approach, for what it's worth, was that I flag anything that contains 
some unsubscribe links and French law reminders because anyway all the 
ones I receive are spam, and I add the opt-in mailing/newsletter I 
receive to whitelist_from in user_prefs, i.e. I kill everything except 
those explicitly allowed.


That's a strategy I tried when I first started writing SA rules, but 
soon rejected due to the workload of detecting and whitelisting new 
opt-in subscriptions. It may work for you if you don't have many users 
who sign up for this stuff...


Incidentally, I have a ruleset for French-language "Nigerian" scams 
(which in fact tend to be mostly from Côte d'Ivoire, not Nigeria!) that 
I've been meaning to clean up and make public. I'll try to get round to 
that soon...


John.

--
-- Over 3000 webcams from ski resorts around the world - www.snoweye.com
-- Translate your technical documents and web pages- www.tradoc.fr


Philosophy for opt-in (was Re: [Rule Set proposal] French Rules

2008-06-24 Thread John GALLET

Hi,

If these are hit rates with a very minimal daily corpus, don't know if the 
present ruleset is ready for production unless you have 0 tolerance for any 
bulk, period


I'm afraid I must agree. I don't have a confirmed and sorted corpus per se, 
but after a single night's live testing with very low scores I can confirm 
that, as I suspected, many of these rules hit genuine opt-in newsletters and 
even things like ebay notifications in French.


Thanks for the feedback. I do not have any ebay subscriptors in my users, 
except one power-seller who has ebay thingies in whitelist.


I will however keep the ruleset live for a while, to see whether the 
online meds and onling gambling rules actually hit anything.


They should, they do on my machines. But actually, they are only useful 
for a "new" spam that has not been caught yet by RBL. When I wrote them, 
it was because spam *was* getting through, now they just push towards 
"almost-probably-spam". Another note is that much of this particular spam 
is auto and badly translated (much "pidgin-French" if I can say so).


My personal tolerance for bulk mail is pretty low, and in a way I'd love to 
use rules like these, with just a bit of fine tuning - the rules do also hit 
a fair bit of French spam. But unfortunately my users actually want to 
receive their newsletters and even complain if it ends up in their spam 
folder.


I think I have a newbye simple problem of philosophy/strategy: my 
approach, for what it's worth, was that I flag anything that contains some 
unsubscribe links and French law reminders because anyway all the ones I 
receive are spam, and I add the opt-in mailing/newsletter I receive to 
whitelist_from in user_prefs, i.e. I kill everything except those 
explicitly allowed.


If that is not the correct approach, I can garantee you the current way 
the rules are written is bad (too harsh), and I need strategy advice on 
how to manage opt-in lists.


John



Re: hit frequencies (was Re: [Rule Set proposal] French Rules

2008-06-24 Thread Michael Monnerie
On Dienstag, 24. Juni 2008 John Wilcock wrote:
> with just a bit of fine tuning

I guess John Gallet needs a bigger corpus, maybe you could share some 
ham/spam with him. He does the work to create the rules, and with 
better corpus the rules will become better. I know this, I maintain the 
GERMAN ruleset and it's hard without any reports from others.

mfg zmi
-- 
// Michael Monnerie, Ing.BSc-  http://it-management.at
// Tel: 0660 / 415 65 31  .network.your.ideas.
// PGP Key: "curl -s http://zmi.at/zmi.asc | gpg --import"
// Fingerprint: AC19 F9D5 36ED CD8A EF38  500E CE14 91F7 1C12 09B4
// Keyserver: www.keyserver.net   Key-ID: 1C1209B4


signature.asc
Description: This is a digitally signed message part.


Re: hit frequencies (was Re: [Rule Set proposal] French Rules

2008-06-24 Thread John Wilcock

Yet Another Ninja a écrit :
If these are hit rates with a very minimal daily corpus, don't know if 
the present ruleset is ready for production unless you have 0 tolerance 
for any bulk, period


I'm afraid I must agree. I don't have a confirmed and sorted corpus per 
se, but after a single night's live testing with very low scores I can 
confirm that, as I suspected, many of these rules hit genuine opt-in 
newsletters and even things like ebay notifications in French. I will 
however keep the ruleset live for a while, to see whether the online 
meds and onling gambling rules actually hit anything.


My personal tolerance for bulk mail is pretty low, and in a way I'd love 
to use rules like these, with just a bit of fine tuning - the rules do 
also hit a fair bit of French spam. But unfortunately my users actually 
want to receive their newsletters and even complain if it ends up in 
their spam folder.


John.

--
-- Over 3000 webcams from ski resorts around the world - www.snoweye.com
-- Translate your technical documents and web pages- www.tradoc.fr


Re: hit frequencies (was Re: [Rule Set proposal] French Rules

2008-06-23 Thread John GALLET

Re,

I excluded the last two rules from my masscheck to avoid FPs as these 
ESPs/X-Mailers are definitely grey, "import rcpt list and blast" sort of ESPs 
not black for global use.


If you can point me to some more information on how to do that, on-list or 
off-list, I am interested. I am new to this whole business.


In fact I was forced to look at X-Mailer and other strange headers for 
French spam that was still getting through with no real easy keywords, and 
these guys often ad the good idea to have developped their own "software" 
and be proud of it.


#counts   FR_SPAMISLEGAL   8s/2h of 3859 corpus (1166s/2693h AXB-MC1) 
06/23/08
#counts   FR_SPAMISLEGAL_2 5s/2h of 3859 corpus (1166s/2693h AXB-MC1) 
06/23/08
#counts   FR_NOTSPAM   0s/0h of 3859 corpus (1166s/2693h AXB-MC1) 
06/23/08
#counts   FR_PAYLESSTAXES  0s/0h of 3859 corpus (1166s/2693h AXB-MC1) 
06/23/08
#counts   FR_REALESTATE_INVEST 0s/0h of 3859 corpus (1166s/2693h AXB-MC1) 
06/23/08
#counts   FR_ONLINEGAMBLING0s/0h of 3859 corpus (1166s/2693h AXB-MC1) 
06/23/08
#counts   FR_ONLINEMEDS0s/0h of 3859 corpus (1166s/2693h AXB-MC1) 
06/23/08
#counts   FR_REASON_SUBSCRIBE  1s/1h of 3859 corpus (1166s/2693h AXB-MC1) 
06/23/08
#counts   FR_HOWTOUNSUBSCRIBE  7s/16h of 3859 corpus (1166s/2693h 
AXB-MC1) 06/23/08


If these are hit rates with a very minimal daily corpus, don't know if the 
present ruleset is ready for production unless you have 0 tolerance for any 
bulk, period


I do subscribe to various mailing lists, and none of them seemed compelled 
to remind me how to unsubscribe, even less to state me the law about spam.


Even the official government "conseil des ministres" (sum up of the 
daily/weekly/whatever government meeting) does not state the "loi 
informatique et libertés" anymore (but they do use a company I am getting 
a lot of spam from ).


So basically the question is: what makes a spam in French recognizable.

On the other hand I am also worried about the very low hits of most rules.

If all your 1166 spams are in French, we can throw the whole ruleset to 
/dev/null (well I'll keep it for me anyway).


A++;
JG



seekrules over French spam (was Re: [Rule Set proposal] French Rules

2008-06-23 Thread John GALLET

Hi,


You run "seek-phrases-in-corpus" over the 2 corpora, and it'll spit out
the patterns; you can then write rules based on these.


I did so, the results are interesting, though I do not really know where 
to go from there. If I take the first 50 "best" patterns and strip off the 
obvious stand-alone words and sure-to-be-false-positive expressions, here 
is what I get to: (sorry for non French speakers, explanation below)


 RATIO   SPAM%HAM%   DATA
 1.000   9.375   0.000  /Pour ne plus recevoir /
 1.000   6.875   0.000  /6 janvier 1978 relative /
 1.000   6.875   0.000  /affiche pas correctement, vous pouvez le visualiser en/
 1.000   5.625   0.000  /s données nominatives /
 1.000   5.625   0.000  / ce message, cliquez-ici/
 1.000   5.625   0.000  / vous désinscrire de /
 1.000   5.000   0.000  /Conformément à l/
 1.000   5.000   0.000  / plus recevoir d\'informations de notre part/
 1.000   5.000   0.000  /un droit d\'accès/
 1.000   4.375   0.000  /ment Ã|  l\'article 34 de la loi/
 1.000   4.375   0.000  /ment à l\'article 34 de la loi /
 1.000   3.750   0.000  /ous désinscrire de notre /
 1.000   3.750   0.000  /es nominatives vous concernant\. /
 1.000   3.750   0.000  / Libertés du 6 /
 1.000   3.750   0.000  /es vous concernant\. Pour l\'exercer, /

As you can see, charset encoding makes a mess, and many must be regrouped.

Anyway, these are the patterns I tried to code in FR_SPAMISLEGAL and 
FR_HOWTOUNSUBSCRIBE, plus one I considered too generic (if you can't 
read this mail in html, click here).


The whole result is available at 
http://www.saphirtech.fr/spam/seekrules_fr_1.txt



 http://taint.org/x/2008/seekrules_run


I also adapted this one (paths of course, but also forced "mbox" format, 
"detect" spit out zero results), but the result is even less "readable" 
for me. I miss the script seekrules/kill_bad_patterns which I presume 
removes stand alone words and such things.


Whole result at http://www.saphirtech.fr/spam/seekrules_fr_2.txt

John

Re: hit frequencies (was Re: [Rule Set proposal] French Rules

2008-06-23 Thread Yet Another Ninja

On 6/23/2008 4:36 PM, John GALLET wrote:

Hi,

First of all, thanks to Justin for patiently helping me to install 
mass-check and pointing me in the right direction. I will try to run the 
algorithms tonight to see what they come up with.


In the meantime, you can find a hit-frequencies report at:
http://www.saphirtech.fr/spam/freqs_2008_06_23.txt

All rules are prefixed with FR_ and are available in the same directory.

I must say I did not double check for stray spam in my mailbox before 
using it as a ham corpus but it *should* be clean. I'll double check for 
next run. The spam corpus was 100% French spam, hand-picked over the 
last week through the "probably-spam" class (default score values 5-15).


Any feedback on the results (not enough in corpus, bad rules, good 
rules, etc.) appreciated.


I excluded the last two rules from my masscheck to avoid FPs as these 
ESPs/X-Mailers are definitely grey, "import rcpt list and blast" sort of 
ESPs not black for global use.



#counts   FR_SPAMISLEGAL   8s/2h of 3859 corpus (1166s/2693h 
AXB-MC1) 06/23/08
#counts   FR_SPAMISLEGAL_2 5s/2h of 3859 corpus (1166s/2693h 
AXB-MC1) 06/23/08
#counts   FR_NOTSPAM   0s/0h of 3859 corpus (1166s/2693h 
AXB-MC1) 06/23/08
#counts   FR_PAYLESSTAXES  0s/0h of 3859 corpus (1166s/2693h 
AXB-MC1) 06/23/08
#counts   FR_REALESTATE_INVEST 0s/0h of 3859 corpus (1166s/2693h 
AXB-MC1) 06/23/08
#counts   FR_ONLINEGAMBLING0s/0h of 3859 corpus (1166s/2693h 
AXB-MC1) 06/23/08
#counts   FR_ONLINEMEDS0s/0h of 3859 corpus (1166s/2693h 
AXB-MC1) 06/23/08
#counts   FR_REASON_SUBSCRIBE  1s/1h of 3859 corpus (1166s/2693h 
AXB-MC1) 06/23/08
#counts   FR_HOWTOUNSUBSCRIBE  7s/16h of 3859 corpus (1166s/2693h 
AXB-MC1) 06/23/08


If these are hit rates with a very minimal daily corpus, don't know if 
the present ruleset is ready for production unless you have 0 tolerance 
for any bulk, period





Re: hit frequencies (was Re: [Rule Set proposal] French Rules

2008-06-23 Thread John GALLET
Thanks for taking this burden upon yourself. One other thing you should be 
prepared to do, if you're willing to devote long-term responsibility to these 
rules, is to provide sa-update-compatible feeds of your dynamic rules. This 
is another thing that Justin can probably help you with.


I am happy with trying to do so, but I am honestly not worried about the 
feed part, all it bores down to is putting the right file at the right 
place (be it push or pull, ftp or rsync, whatever).


What I am more worried about is testing regularly the rules, and, even 
before that, checking that they are valid. They are "good" on my system 
with my users, but then they were custom-tailored to be so.


JG



Re: hit frequencies (was Re: [Rule Set proposal] French Rules

2008-06-23 Thread John GALLET

Re,

Looking at the rules, I'm worried about false positives on genuine opt-in 
advertising. I have a number of users who choose to receive all kinds of 
advertising blurb,


This is one of the reasons why I did not hunt for "click here" and "if you 
can't see this email in html". Now correct me if I am wrong (ouch, no, not 
on the head), but isn't this what whitelist_from is for ? I never was able 
to let the Intel newsletter through (it is in English), it would always be 
caught by SA. Same went for Microsoft Support genuine answers (ok, don't 
laugh).


so I'll run your rules with very low scores for a while to see what gets 
hit.


You can have a little more information, and exactly this suggestion, by 
reading http://www.saphirtech.fr/spamassassin.html


JG



Re: hit frequencies (was Re: [Rule Set proposal] French Rules

2008-06-23 Thread John Wilcock

John GALLET a écrit :
Any feedback on the results (not enough in corpus, bad rules, good 
rules, etc.) appreciated.


Looking at the rules, I'm worried about false positives on genuine 
opt-in advertising. I have a number of users who choose to receive all 
kinds of advertising blurb, so I'll run your rules with very low scores 
for a while to see what gets hit.


John.

--
-- Over 3000 webcams from ski resorts around the world - www.snoweye.com
-- Translate your technical documents and web pages- www.tradoc.fr


Re: hit frequencies (was Re: [Rule Set proposal] French Rules

2008-06-23 Thread John Hardin

On Mon, 23 Jun 2008, John GALLET wrote:

First of all, thanks to Justin for patiently helping me to install 
mass-check and pointing me in the right direction.


Applause for Justin! This is the sort of thing we need to see for many 
more specialized spam categories...



I will try to run the algorithms tonight to see what they come up with.


Thanks for taking this burden upon yourself. One other thing you should be 
prepared to do, if you're willing to devote long-term responsibility to 
these rules, is to provide sa-update-compatible feeds of your dynamic 
rules. This is another thing that Justin can probably help you with.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 [EMAIL PROTECTED]FALaholic #11174 pgpk -a [EMAIL PROTECTED]
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  The problem is when people look at Yahoo, slashdot, or groklaw and
  jump from obvious and correct observations like "Oh my God, this
  place is teeming with utter morons" to incorrect conclusions like
  "there's nothing of value here".-- Al Petrofsky, in Y! SCOX
---
 11 days until the 232nd anniversary of the Declaration of Independence


hit frequencies (was Re: [Rule Set proposal] French Rules

2008-06-23 Thread John GALLET

Hi,

First of all, thanks to Justin for patiently helping me to install 
mass-check and pointing me in the right direction. I will try to run the 
algorithms tonight to see what they come up with.


In the meantime, you can find a hit-frequencies report at:
http://www.saphirtech.fr/spam/freqs_2008_06_23.txt

All rules are prefixed with FR_ and are available in the same directory.

I must say I did not double check for stray spam in my mailbox before 
using it as a ham corpus but it *should* be clean. I'll double check for 
next run. The spam corpus was 100% French spam, hand-picked over the last 
week through the "probably-spam" class (default score values 5-15).


Any feedback on the results (not enough in corpus, bad rules, good rules, 
etc.) appreciated.


Sincerely,
JG



Re: [Rule Set proposal] French Rules

2008-06-19 Thread Justin Mason

Giampaolo Tomassoni writes:
> > -Original Message-
> > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
> > Sent: Thursday, June 19, 2008 5:49 PM
> > To: Giampaolo Tomassoni
> > Cc: [EMAIL PROTECTED]; users@spamassassin.apache.org
> > Subject: Re: [Rule Set proposal] French Rules
> > 
> > ...omissis...
> >
> 
> Ok, I see I have to get a copy of some reference mass-check: mine is mostly
> in Italian and I'm getting a lot of stuff which could easily result in FPs.
> See:
> 
> #  1.000   6.655   0.000
> body SEEK_OKRP_V  /We/
> #  1.000   4.292   0.000
> body SEEK_ZHYXLF  / Redmond, WA /
> #  1.000   4.292   0.000
> body SEEK_EFMKIR  /Microsoft/
> #  1.000   4.040   0.000
> body SEEK_V__XNS  /Get/
> #  1.000   3.841   0.000
> body SEEK_EXHMOF  /This/

yeah, you'll need to ensure your ham corpus contains lots of both english
_and_ Italian text ;)

--j.


RE: [Rule Set proposal] French Rules

2008-06-19 Thread Giampaolo Tomassoni
> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
> Sent: Thursday, June 19, 2008 5:49 PM
> To: Giampaolo Tomassoni
> Cc: [EMAIL PROTECTED]; users@spamassassin.apache.org
> Subject: Re: [Rule Set proposal] French Rules
> 
> ...omissis...
>

Ok, I see I have to get a copy of some reference mass-check: mine is mostly
in Italian and I'm getting a lot of stuff which could easily result in FPs.
See:

#  1.000   6.655   0.000
body SEEK_OKRP_V  /We/
#  1.000   4.292   0.000
body SEEK_ZHYXLF  / Redmond, WA /
#  1.000   4.292   0.000
body SEEK_EFMKIR  /Microsoft/
#  1.000   4.040   0.000
body SEEK_V__XNS  /Get/
#  1.000   3.841   0.000
body SEEK_EXHMOF  /This/

Thank you Justing,

Giampaolo



Re: [Rule Set proposal] French Rules

2008-06-19 Thread Justin Mason

Giampaolo Tomassoni writes:
> > -Original Message-
> > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
> > Sent: Thursday, June 19, 2008 5:28 PM
> > To: Giampaolo Tomassoni
> > Cc: [EMAIL PROTECTED]; users@spamassassin.apache.org
> > Subject: Re: [Rule Set proposal] French Rules
> > 
> > 
> > Giampaolo Tomassoni writes:
> > > > -Original Message-
> > > > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
> > > > Sent: Wednesday, June 18, 2008 12:10 PM
> > > > To: John GALLET
> > > > Cc: users@spamassassin.apache.org
> > > > Subject: Re: [Rule Set proposal] French Rules
> > > >
> > > > ...omissis...
> > > >
> > > > by the way, if you're reasonably perl-capable, it might be
> > worthwhile
> > > > using the algorithm I use to generate the JM_SOUGHT ruleset for
> > english
> > > > spam: http://taint.org/tag/rule-discovery
> > > >
> > > > you just give it a corpus of spam samples and it generates the
> > rules
> > > > for
> > > > you.  The code is in SpamAssassin SVN.
> > > >
> > > > --j.
> > >
> > > Nah, that's great!
> > >
> > > I regret I can only occasionally read interesting messages due to my
> > own
> > > time constraints. I could have read about this set of scripts weeks
> > ago,
> > > otherwise...
> > >
> > > How this code is supposed to be used? I see these scripts in rule-
> > dev:
> > > maildir-scan-headers, seek-phrases-in-corpus, seek-phrases-in-log and
> > > strip-high-scorers-from-log.
> > >
> > > Give us a brief description of their work and usage.
> > 
> > Basically, you collect 2 corpora:
> > 
> > 1. a big corpus of ham samples, stuff that you do not want to match.
> > 
> > 2. a smaller corpus of spam samples.
> > 
> > You run "seek-phrases-in-corpus" over the 2 corpora, and it'll spit out
> > the patterns; you can then write rules based on these.
> > 
> > Alternatively run "mass-check" and "seek-phrases-in-log" directly as
> > that
> > script does, to get a bit more control (and generate real SpamAssassin
> > rules).  That's what the JM_SOUGHT scripts do.  See below:
> > 
> >   http://taint.org/x/2008/seekrules_run
> > 
> > that script also calls "mk_meta_rule", which is here:
> > http://taint.org/x/2008/mk_meta_rule
> 
> Running seek-phrases-in-corpus I get a lot of these:
> 
>   "Wide character in print at
> /home/whatever/masses/plugins/Dumptext.pm line 26."
> 
> Is it an issue with UTF-8 multibyte characters?

yes. It seems harmless -- I never got around to tracking it down.


RE: [Rule Set proposal] French Rules

2008-06-19 Thread Giampaolo Tomassoni
> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
> Sent: Thursday, June 19, 2008 5:28 PM
> To: Giampaolo Tomassoni
> Cc: [EMAIL PROTECTED]; users@spamassassin.apache.org
> Subject: Re: [Rule Set proposal] French Rules
> 
> 
> Giampaolo Tomassoni writes:
> > > -Original Message-
> > > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
> > > Sent: Wednesday, June 18, 2008 12:10 PM
> > > To: John GALLET
> > > Cc: users@spamassassin.apache.org
> > > Subject: Re: [Rule Set proposal] French Rules
> > >
> > > ...omissis...
> > >
> > > by the way, if you're reasonably perl-capable, it might be
> worthwhile
> > > using the algorithm I use to generate the JM_SOUGHT ruleset for
> english
> > > spam: http://taint.org/tag/rule-discovery
> > >
> > > you just give it a corpus of spam samples and it generates the
> rules
> > > for
> > > you.  The code is in SpamAssassin SVN.
> > >
> > > --j.
> >
> > Nah, that's great!
> >
> > I regret I can only occasionally read interesting messages due to my
> own
> > time constraints. I could have read about this set of scripts weeks
> ago,
> > otherwise...
> >
> > How this code is supposed to be used? I see these scripts in rule-
> dev:
> > maildir-scan-headers, seek-phrases-in-corpus, seek-phrases-in-log and
> > strip-high-scorers-from-log.
> >
> > Give us a brief description of their work and usage.
> 
> Basically, you collect 2 corpora:
> 
> 1. a big corpus of ham samples, stuff that you do not want to match.
> 
> 2. a smaller corpus of spam samples.
> 
> You run "seek-phrases-in-corpus" over the 2 corpora, and it'll spit out
> the patterns; you can then write rules based on these.
> 
> Alternatively run "mass-check" and "seek-phrases-in-log" directly as
> that
> script does, to get a bit more control (and generate real SpamAssassin
> rules).  That's what the JM_SOUGHT scripts do.  See below:
> 
>   http://taint.org/x/2008/seekrules_run
> 
> that script also calls "mk_meta_rule", which is here:
> http://taint.org/x/2008/mk_meta_rule

Running seek-phrases-in-corpus I get a lot of these:

"Wide character in print at
/home/whatever/masses/plugins/Dumptext.pm line 26."

Is it an issue with UTF-8 multibyte characters?

Giampaolo


> 
> --j.



Re: [Rule Set proposal] French Rules

2008-06-19 Thread Justin Mason

Giampaolo Tomassoni writes:
> > -Original Message-
> > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
> > Sent: Wednesday, June 18, 2008 12:10 PM
> > To: John GALLET
> > Cc: users@spamassassin.apache.org
> > Subject: Re: [Rule Set proposal] French Rules
> > 
> > ...omissis...
> >
> > by the way, if you're reasonably perl-capable, it might be worthwhile
> > using the algorithm I use to generate the JM_SOUGHT ruleset for english
> > spam: http://taint.org/tag/rule-discovery
> > 
> > you just give it a corpus of spam samples and it generates the rules
> > for
> > you.  The code is in SpamAssassin SVN.
> > 
> > --j.
> 
> Nah, that's great!
> 
> I regret I can only occasionally read interesting messages due to my own
> time constraints. I could have read about this set of scripts weeks ago,
> otherwise...
> 
> How this code is supposed to be used? I see these scripts in rule-dev:
> maildir-scan-headers, seek-phrases-in-corpus, seek-phrases-in-log and
> strip-high-scorers-from-log.
> 
> Give us a brief description of their work and usage.

Basically, you collect 2 corpora:

1. a big corpus of ham samples, stuff that you do not want to match.

2. a smaller corpus of spam samples.

You run "seek-phrases-in-corpus" over the 2 corpora, and it'll spit out
the patterns; you can then write rules based on these.

Alternatively run "mass-check" and "seek-phrases-in-log" directly as that
script does, to get a bit more control (and generate real SpamAssassin
rules).  That's what the JM_SOUGHT scripts do.  See below:

  http://taint.org/x/2008/seekrules_run

that script also calls "mk_meta_rule", which is here:
http://taint.org/x/2008/mk_meta_rule

--j.


RE: [Rule Set proposal] French Rules

2008-06-19 Thread Giampaolo Tomassoni
> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, June 18, 2008 12:10 PM
> To: John GALLET
> Cc: users@spamassassin.apache.org
> Subject: Re: [Rule Set proposal] French Rules
> 
> ...omissis...
>
> by the way, if you're reasonably perl-capable, it might be worthwhile
> using the algorithm I use to generate the JM_SOUGHT ruleset for english
> spam: http://taint.org/tag/rule-discovery
> 
> you just give it a corpus of spam samples and it generates the rules
> for
> you.  The code is in SpamAssassin SVN.
> 
> --j.

Nah, that's great!

I regret I can only occasionally read interesting messages due to my own
time constraints. I could have read about this set of scripts weeks ago,
otherwise...

How this code is supposed to be used? I see these scripts in rule-dev:
maildir-scan-headers, seek-phrases-in-corpus, seek-phrases-in-log and
strip-high-scorers-from-log.

Give us a brief description of their work and usage.

Nice idea, Justin!

Giampaolo



Re: [Rule Set proposal] French Rules

2008-06-19 Thread John GALLET


I still miss samples for two rules, even if I did had hits according to 
/var/spool/maillog I did not save them.


I added a sample for the FR_NOTSPAM rule, and I removed the 
FR_YOURELUCKY rule as I see other forms of the text getting through so 
it is not efficient. On the other hand, nearly all these messages are 
caught with RBL rules so I might even remove it completely if I can't find 
an efficient one.


John
PS: reminder, rules and samples avaible at
http://www.saphirtech.fr/spam/



Re: [Rule Set proposal] French Rules

2008-06-18 Thread Justin Mason

John GALLET writes:
> Hi,
> 
> This is my first post on this list and first ruleset, so please point me 
> to the right place/documents if I am doing anything wrong.
> 
> According to a search of this list on markmail.org, there have been few 
> subjects about spam in French and (no disrespect meant) I would agree with 
> the comments I read about the current French Ruleset being inadequate 
> (tried it, did not keep any of it).
> 
> So I would like to propose a set for French Rules and get your feedback.

by the way, if you're reasonably perl-capable, it might be worthwhile
using the algorithm I use to generate the JM_SOUGHT ruleset for english
spam: http://taint.org/tag/rule-discovery

you just give it a corpus of spam samples and it generates the rules for
you.  The code is in SpamAssassin SVN.

--j.


Re: [Rule Set proposal] French Rules

2008-06-17 Thread John GALLET

Hi,


I was able to access the URL you mentioned, but not all of the files
below it.  I received:
"Forbidden
You don't have permission to access /spam/FR_PAYLESSTAXES.txt on this server."


Sorry guys, only the ruleset file (the one I tried, of course) was 
readable, all the non empty spam samples had bad rights. This is fixed.


I still miss samples for two rules, even if I did had hits according to 
/var/spool/maillog I did not save them.


John




Re: [Rule Set proposal] French Rules

2008-06-17 Thread Big Wave Dave
On Tue, Jun 17, 2008 at 12:11 PM, John GALLET
<[EMAIL PROTECTED]> wrote:
> Hi,
>
> This is my first post on this list and first ruleset, so please point me to
> the right place/documents if I am doing anything wrong.
>
> According to a search of this list on markmail.org, there have been few
> subjects about spam in French and (no disrespect meant) I would agree with
> the comments I read about the current French Ruleset being inadequate (tried
> it, did not keep any of it).
>
> So I would like to propose a set for French Rules and get your feedback.
>
> You can find both the rules and some sample spam email messages (two of them
> missing, I have hits in my log files, but deleted them) at the following
> URL: http://www.saphirtech.fr/spam/
>
> I have been running these for about a month sitewise on three domains, I
> have not seen any false positives (yet).
>
> Sincerely,
> JG

I was able to access the URL you mentioned, but not all of the files
below it.  I received:
"Forbidden
You don't have permission to access /spam/FR_PAYLESSTAXES.txt on this server."


Dave