Hello Chris, all, Wednesday, July 20, 2005, 8:37:20 AM, you wrote:
>> So I think the one-off is more useful for developing a ruleset >> while the daily full MC is more useful for integration testing. >> Personally, I participate in both. With my small corpus it's no >> biggie either way. And with my corpus, with 2 months' emails counting 250k, a single rule mass-check takes 8 hours. A full system mass-check (that one rule plus all distribution rules) can easily take 16 hours, especially if that one rule is subject to performance problems. CS> one-off corpus checks are a serious backbone to SARE's success. IT CS> allows us to see which direction the rules should take, what won't CS> work, and the rules are often coded differently the first 2 "one CS> off" checks, then changed for a final. CS> Its not uncommon to see a rule or ruleset checked 3-5 times before CS> release. Which often takes less then a day. Indeed, it's not uncommon for a rule or ruleset to be checked 2-3 times with knowingly excessive regexes, so we can see what actually is or isn't being matched in various regex hits. We use this information to improve the rule, and then remove the excess to the regex for a final pre-publication run. CS> Perhaps some thing like the dev "bug squish events" could be used? CS> Once a week the people who run SARE rule sets check to see the CS> biggest hitters, and on that day we test those heavy hitters CS> against a bigger corpus, and look to add to SA. Successful ones CS> get moved out of SARE and into SA. That would work for me. I'm not suggesting that all of these options be available -- just throwing out ideas: a) a mailing list for quick turn-around checks. I can probably create a system here with a special corpus that contains randomly selected 5k ham and 5k spam from the full 2 month corpus, and turn most of those mass-checks around in minutes. Chris T and Fred T already have systems running that turn around mass-checks in minutes. This is a technique we can probably share (if needed -- ours isn't the only method). b) nightly mass-checks run specifically on rules submitted for this purpose. This would be include rules/70_testing.cf, but not the entirety of rules/*.cf. This would enable reasonably quick mass-checks on all sorts of rules. c) weekly mass-checks on the full svn trunk rules/*.cf plus, plus anything submitted to the weekly mass-check (which could be everything submitted to the nightly mass-check except those that are specifically pulled). These will run longer, but give better statistics in comparison between rules. d) monthly mass-checks of ", with network tests enabled. e) special network mass-checks of rules that require network access, as needed. f) monthly rescoring mass-check for rules that are worth while, followed by perceptron to rescore rules, and an sa-update to distribute the rules. Bob Menschel
