Theo Van Dinter writes:
> On Wed, Sep 06, 2006 at 07:07:42PM +0100, Justin Mason wrote:
> > the problem is that it needs to read the rules from rulesrc/sandbox/* --
> > and those rules are pretty dependent in places on the rules in
> > rulesrc/core.  Those rules, in turn, are the 3.2.0 core ruleset, which
> > doesn't mix well with (ie stomps all over) the 3.1.x core ruleset.
> > 
> > We could come up with a way to use the 3.2.0 core ruleset in place of
> > the 3.1.x one -- but I think the effort required would be too much, esp.
> > since it's easier to just concentrate on the 3.2.0 release instead.
> 
> I'm not sure I agree with this, and we *need* to solve this problem
> going forward, or else we won't be able to do 3.2 updates when we're
> working on 3.3.
> 
> The rules are pretty version agnostic, except for the ones which have a
> dependency on a plugin or other code change that 3.1 doesn't have.  I think
> it'd be pretty easy to do a run with the 3.2 code and run with the 3.1 code
> and figure out which those are.
> 
> Rules that don't work the same get an "if version" wrapper, the rest can stay
> the way they are.  We can also look at backporting the differences as
> appropriate.

OK -- agreed.  However, my point is that we'd be better off doing that
work as part of the 3.2.0 development, and later for 3.3.0 -- rather
than trying to "retrofit" it into 3.1.6 or 3.1.7, I think.

There's no *need* to keep 3.1.x going, if we can start getting 3.2.0
released instead.

> As for rulesrc, mkrules, etc -- 3.1 doesn't need any of that.  This is also my
> main issue with how 3.2 currently does stuff.  I don't understand why this
> stuff is part of the normal distro.  I like to think of the distro as the
> engine side of the project, and mkrules/rulesrc as the rules side of the
> project, and there's no reason they have to be together.
> 
> So for 3.1, we generate, externally, the rules directory and include it in the
> directory that gets mass-check'ed.  For 3.2, same thing.  Then in the normal
> SA distribution, we don't need the whole svn:external/rulesrc/mkrules/etc
> stuff, it'll just be a rules dir like before.

Hmm -- I think I'd need more details of how that'd work -- I'm not
sure I get it.

One thing I'd want to avoid is having to set up two separate SVN
workspaces to get a usable checkout, or having to download two separate
tarballs to get a usable release.  In my opinion, the core code
is nearly useless without rules, so there isn't a need to ship it
without them.

> > last week featuring data from 7 contributors.
> 
> Hrm.  It still seems like a small number of messages/diversity:
> 
>        6  0.0 ham-bb-doc.log
>    14998 18.9 ham-bb-jm.log
>        6  0.0 ham-bb-zmi.log
>     6357  8.0 ham-cthielen.log
>     1510  1.9 ham-daf.log
>      167  0.2 ham-dos.log
>     1958  2.5 ham-parkerm.log
>    46895 59.0 ham-theo.log
>     2028  2.5 ham-wtogami.log
>     5619  7.1 ham-zmi.log
> 
>     15006  3.9 spam-bb-doc.log
>     15000  3.9 spam-bb-jm.log
>      8358  2.2 spam-bb-zmi.log
>     13783  3.5 spam-cthielen.log
>      6261  1.6 spam-daf.log
>      4676  1.2 spam-dos.log
>     61619 15.9 spam-parkerm.log
>    253448 65.2 spam-theo.log
>      2156  0.6 spam-wtogami.log
>      8359  2.2 spam-zmi.log
> 
> (that's 468210 total, btw)   and why does zmi have two sets of files?

It looks like his spam collections are a dup, alright.

for what it's worth, the "bb-*" mass-checks are limited to 15k messages of
each type, max, since mass-checking old spam is pointless.  I should
set it up to allow more ham, however, since old ham is fine.

I was also thinking we should set up some trusted spamtraps to collect
lots of spam with "live" network test data -- I think most of our spam
corpora we're mass-checking nowadays is incomplete.  for example, my
corpora will omit everything that hit SBL+XBL, and Michael's is similarly
omitting lots of those too.  Nowadays spamtrapping may be the only viable
way to get a really representative spam corpus....

--j.

Reply via email to