Michael Bell said:

> Kinda hard to say. Most of it IS spammy and valid MIME as far as I
> could tell. I did catch a few clearly-non-spam (evite) things in the
> corpus. 
> 
> The lack of Received lines does mess up quite a few DNS related tests
> (RBL, MX records) but I wouldn't think that alone made a 23%
> difference (83% on Justin's sample) in success. Remember - I ran SA
> 2.43 in both cases with -L so most of the stuff relying on that isn't
> relevant.

I haven't looked yet, but

  (a) if they're not well-cleaned (ie if there is valid nonspam in there),
  it's going to seriously impact the archive's usefulness.

  (b) on the other issue: a lot of SpamAssassin's top tests use header
  info, even in the -L case, so with the headers removed, a 20% accuracy
  drop would be about right.

--j.


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to