Giampaolo Tomassoni writes:
> > -----Original Message-----
> > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
> > Sent: Thursday, June 19, 2008 5:28 PM
> > To: Giampaolo Tomassoni
> > Cc: [EMAIL PROTECTED]; users@spamassassin.apache.org
> > Subject: Re: [Rule Set proposal] French Rules
> > 
> > 
> > Giampaolo Tomassoni writes:
> > > > -----Original Message-----
> > > > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
> > > > Sent: Wednesday, June 18, 2008 12:10 PM
> > > > To: John GALLET
> > > > Cc: users@spamassassin.apache.org
> > > > Subject: Re: [Rule Set proposal] French Rules
> > > >
> > > > ...omissis...
> > > >
> > > > by the way, if you're reasonably perl-capable, it might be
> > worthwhile
> > > > using the algorithm I use to generate the JM_SOUGHT ruleset for
> > english
> > > > spam: http://taint.org/tag/rule-discovery
> > > >
> > > > you just give it a corpus of spam samples and it generates the
> > rules
> > > > for
> > > > you.  The code is in SpamAssassin SVN.
> > > >
> > > > --j.
> > >
> > > Nah, that's great!
> > >
> > > I regret I can only occasionally read interesting messages due to my
> > own
> > > time constraints. I could have read about this set of scripts weeks
> > ago,
> > > otherwise...
> > >
> > > How this code is supposed to be used? I see these scripts in rule-
> > dev:
> > > maildir-scan-headers, seek-phrases-in-corpus, seek-phrases-in-log and
> > > strip-high-scorers-from-log.
> > >
> > > Give us a brief description of their work and usage.
> > 
> > Basically, you collect 2 corpora:
> > 
> > 1. a big corpus of ham samples, stuff that you do not want to match.
> > 
> > 2. a smaller corpus of spam samples.
> > 
> > You run "seek-phrases-in-corpus" over the 2 corpora, and it'll spit out
> > the patterns; you can then write rules based on these.
> > 
> > Alternatively run "mass-check" and "seek-phrases-in-log" directly as
> > that
> > script does, to get a bit more control (and generate real SpamAssassin
> > rules).  That's what the JM_SOUGHT scripts do.  See below:
> > 
> >   http://taint.org/x/2008/seekrules_run
> > 
> > that script also calls "mk_meta_rule", which is here:
> > http://taint.org/x/2008/mk_meta_rule
> 
> Running seek-phrases-in-corpus I get a lot of these:
> 
>       "Wide character in print at
> /home/whatever/masses/plugins/Dumptext.pm line 26."
> 
> Is it an issue with UTF-8 multibyte characters?

yes. It seems harmless -- I never got around to tracking it down.

Reply via email to