On Wed, Jul 20, 2005 at 03:13:29PM -0700, Loren Wilton wrote: > A big problem though is that most of our test corpuses are english/american, > so we never really know what will happen in Germany, Sweden, or Nigeria. > (Not sure we care about the last. :-) We REALLY need to somehow get to the > point that we have test corpori from various other parts of the world.
copora? http://wiki.apache.org/spamassassin/PluralOfCorpus ;-) Unfortunately, getting more corpora means getting more people to volunteer and get involved. And that's one of the issues we're discussing here! :-) I think we would *love* it if we could get more people volunteering their corpora (or more specifically their system resources -- we don't actually want your mail) to help with rule development. I've updated the Wiki so that http://wiki.apache.org/spamassassin/NightlyMassCheck is more up to date and it should be easier for people to figure out how it works. > > Detection of redundancy or linear independency. > > Is my new rule covered or disabled by another rule > > or does it affect existing rules? > > This could be detected which a MassCheck. > > Masscheck has an interdependency option, although it increases the checking > time. We use it on rules once they seem useful, but not usually in early > one-off checking. I'm not sure what you mean by this. We have an "overlap" script which does some of this -- is that what you're talking about? > This is a very interesting idea that I think needs more exploring in the > future. Any SA server that has a Bayes database potentially has most of the > knowledge to be able to participate in Seti-like background processing for > determining rule hit ratios. For that matter, any SA server should be able > to collect logs of the rules that are hitting there, and send out that rule > hit information to some central server once a day. This won't necessarily > give fp/fn hit counts, but it can give total hits per rule, and that is > moderately valuable information in itself, while still being pretty > annonomous. Interesting, I agree. I'm not sure this will help at all with new rule development, but it would give us interesting data over relative hit rates over time. It would certainly be lots of work to set up, though. :-( > Sare has a rule scoring method that Bob developed that assigns a probable > score to the rule based on the masscheck results. Sometimes we modify this > manually based on other factors, but most of the time it goes into the rules > files directly. We know it isn't as good as a full SA scoring run. But on > the other hand, it doesn't require a full SA scoring run, and generally > produces pretty usable results. I would envision smething like this being > used for initial rule introductions, and periodically the rules would be > rescored using a full scoring run. Even better would be to be able to do a full scoring run every night, or every week or something like that, but this would be very difficult to achieve. Perhaps we can look at the results after the 3.1 run and see if there are any relationships we can use between rules hit rates and score. I fear that there's too much interdependency though for this to be possible. -- Duncan Findlay
signature.asc
Description: Digital signature
