Re: Re[6]: Hackathon summary

Justin Mason Tue, 26 Jul 2005 11:25:24 -0700

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I've considered this idea in the past.  here's my thoughts:

First off, it *does* need human verification. Assuming that un-verified
SpamAssassin results are reliable at a given level is quite dangerous --
see the many, *many* reports of bayes auto-learning going haywire.   Many
sites seem to get very different levels of FPs and FNs, and assuming that
levels that apply at one site will apply at others, does not seem to be a
reliable assumption.  unverified results are worse than no results,
because their assumption of reliability is misleading.

Secondly, we do have APIs to do this already -- "spamassassin -r" --
reporting.  Reporting is just equivalent to "learning", except
third-party-operated remote systems will get a copy of the data; it's even
equivalent in that there's a way to report-as-ham as well as
report-as-spam (although there may not be a command line API for that
yet). This API is the perfect way to do this.

Agreed btw that the X-Spam-Status line is all that needs to be reported;
preferably munged into mass-check format (similar to how spamd does it).

A plugin already would have the hooks to do this.  Patches would
be cool ;)

- --j.

Robert Menschel writes:
> Hello Loren,
> 
> Tuesday, July 26, 2005, 1:43:42 AM, you wrote:
> 
> >> More thought ... what if SA systems were to accumulate daily
> >> statistics, along the lines of one record for each rule, containing:
> 
> LW> That sounds like the general sort of vague idea I had, fleshed out in 
> more detail.
> LW> Certainly the desirable goal is basically:
> 
> LW> 1 does this rule hit anything?
> LW> 2 does it hit what it was supposed to hit?
> LW> 3 does it look like a score adjustment might help, either up or down?
> LW> 4 is this hitting something in a language that it wasn't intended to hit?
> 
> LW> I think to do that we need basically annonomous information,
> LW> with the exception that we should know the primary site
> LW> language(s) to help diagnose foreign language problems.
> 
> Rather than the primary site language(s), I'd be more interested in
> the email's language.  ContractorsWarehouse.com understands only
> English, but twice in the last five years we've received non-English
> ham, from Europe from individuals who hoped (or maybe assumed) we
> would understand their language.
> 
> "Email in Polish, spam" is more useful I think than "Home language
> French, spam".
> 
> LW> In addition, I think the site should be able to optionally
> LW> report a site contact address if they want to.  This could be
> LW> useful if the stats indicate that they have a seemingly local rule
> LW> that is doing really well.  There would be someone that we could
> LW> write and ask if they would be willing to contribute it to the
> LW> regular rules.
> 
> Good idea!
> 
> LW> Another thing that would be nice to get from sites would be
> LW> rule overlap information.  I'm not sure how to accumulate this
> LW> with any efficiency, nor how to report it compactly.  But with a
> LW> good idea of rules hitting in the spam/ham categories, and a
> LW> decent indication of rule overlap, it should be possible to
> LW> generate theoretical scoring profiles that would work perhaps
> LW> better than the default.
> 
> The way to get that would be to receive mass-check-like info, perhaps
> a log of every email scanned, just the default (unmodified!)
> X-Spam-Status line. These can then be examined for overlaps, freqs,
> etc.
> 
> Bob Menschel
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFC5n+SMJF5cimLx9ARAgr1AJ40HC8qPbR19kKzW9/l/f9pAWsFfQCfbXIV
+40sGCPuNto0gqVvbPJrS6Q=
=i3HY
-----END PGP SIGNATURE-----

Re: Re[6]: Hackathon summary

Reply via email to