> >>>> One useful factor of ham is that it's not time-sensitive; a mail that > >>>> was ham in 2003 would still be ham today. So we can collect old ham > >>>> mail archives, or submissions of relatively old mail, if necessary. > >>> > >>> This may be a false assumption. A spamvertised or spam sending > >>> domain from 2003 could have expired and been re-registered by > >>> a different organization. Same for ham. Both ham and spam > >>> should have expiration times. 1 year would probably be good, > >>> since spamvertised domains probably don't get renewed. > >> > >> yep, I was talking with a SURBLer about this last week I think. we > >> should probably add meta conditions ot the URIBL ruleset to ensure > >> they don't fire at all on old messages. > > if we had enough ham to get useful results with that limit, sure. As > it is, I'm not sure that's the case.
Btw, I just came across this article (from CEAS 2009): Jose-Marcio Martins da Cruz, Gordon V. Cormack: Using old Spam and Ham Samples to Train Email Filters http://www.j-chkmail.org/ceas/ceas09-gvcjm.pdf Mark
