Giampaolo Tomassoni writes: > > -----Original Message----- > > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] > > Sent: Thursday, June 19, 2008 5:28 PM > > To: Giampaolo Tomassoni > > Cc: [EMAIL PROTECTED]; users@spamassassin.apache.org > > Subject: Re: [Rule Set proposal] French Rules > > > > > > Giampaolo Tomassoni writes: > > > > -----Original Message----- > > > > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] > > > > Sent: Wednesday, June 18, 2008 12:10 PM > > > > To: John GALLET > > > > Cc: users@spamassassin.apache.org > > > > Subject: Re: [Rule Set proposal] French Rules > > > > > > > > ...omissis... > > > > > > > > by the way, if you're reasonably perl-capable, it might be > > worthwhile > > > > using the algorithm I use to generate the JM_SOUGHT ruleset for > > english > > > > spam: http://taint.org/tag/rule-discovery > > > > > > > > you just give it a corpus of spam samples and it generates the > > rules > > > > for > > > > you. The code is in SpamAssassin SVN. > > > > > > > > --j. > > > > > > Nah, that's great! > > > > > > I regret I can only occasionally read interesting messages due to my > > own > > > time constraints. I could have read about this set of scripts weeks > > ago, > > > otherwise... > > > > > > How this code is supposed to be used? I see these scripts in rule- > > dev: > > > maildir-scan-headers, seek-phrases-in-corpus, seek-phrases-in-log and > > > strip-high-scorers-from-log. > > > > > > Give us a brief description of their work and usage. > > > > Basically, you collect 2 corpora: > > > > 1. a big corpus of ham samples, stuff that you do not want to match. > > > > 2. a smaller corpus of spam samples. > > > > You run "seek-phrases-in-corpus" over the 2 corpora, and it'll spit out > > the patterns; you can then write rules based on these. > > > > Alternatively run "mass-check" and "seek-phrases-in-log" directly as > > that > > script does, to get a bit more control (and generate real SpamAssassin > > rules). That's what the JM_SOUGHT scripts do. See below: > > > > http://taint.org/x/2008/seekrules_run > > > > that script also calls "mk_meta_rule", which is here: > > http://taint.org/x/2008/mk_meta_rule > > Running seek-phrases-in-corpus I get a lot of these: > > "Wide character in print at > /home/whatever/masses/plugins/Dumptext.pm line 26." > > Is it an issue with UTF-8 multibyte characters?
yes. It seems harmless -- I never got around to tracking it down.