So, this is coming along nicely. ;)
STORY SO FAR: http://spamassassin.zones.apache.org:8011/ is the main UI -- each time a checkin occurs in SVN, a set of mass-checks are triggered. There are 4 mass-checks at the moment: mc-fast, mc-med, mc-slow and mc-slower. The idea is that the "fast" one completes first, with only a few thousand messages (right now, it mass-checks 5700 mails in 3 minutes!), providing a quick, rough look at freqs: http://spamassassin.zones.apache.org:8011/mc-fast/builds/28/configure_2/0 (Mass-check results from "mc-fast") then, gradually, the other 3 complete and provide their results as well: http://spamassassin.zones.apache.org:8011/mc-slower/builds/21/configure_2/0 (Mass-check results from "mc-slower") The results page presents a basic look at the "freqs" output. At the same time, it starts generating the data for the next step, the rule-QA app: http://buildbot.spamassassin.org/ruleqa/ruleqa?daterev=20051025/r328495 (the rule-QA view of the same data) This allows us to "drill down" to more details about a rule: http://buildbot.spamassassin.org/ruleqa/ruleqa?daterev=20051025%2Fr328495&rule=T_SUBJ_RE_NUM&s_detail=1 (drilled-down for details about "T_SUBJ_RE_NUM") Now, I have a couple of things on the todo list remaining for this app: - message hits-over-time graphs - hitrates on messages by score (does this rule hit high-scoring spams only?) So they're in the pipeline. HELP NEEDED! In addition, we need another thing: mail! There's these issues that we have to worry about, though: - the privacy of submitted ham: in other words, I think most of us might have a hard time uploading our freshest, unchecked ham mail, since there could be private stuff in there. - the freshness of submitted spam: old spam is only partially useful, and in fact can be misleading (ie a rule can fire well but be useless against current and future spam). - the hand-filteredness. we need fresh spam and private ham. Ham doesn't need to be quite as up-to-the-minute-fresh, but spam does. So, next question -- can you provide a corpus, which you're prepared to frequently update? What I'm thinking is, up to about 20k ham/20k spam messages from a few people should be plenty. (This is only the "preflight" mass-check, for quick checking, it doesn't have to be comprehensive; anything from a few thousand up would be perfect.) It's important that the ham stuff be pristine ham, and that the spam be reasonably pristine; spam needs to be up-to-date, ham, not so much. I think the easiest way to transfer it is via rsync. --j.
