On Thursday, April 1, 2004, 11:37:54 PM, Daniel Quinlan wrote: > Jeff Chan <[EMAIL PROTECTED]> writes:
>> Would someone with access to large spam and ham corpi please give >> SpamCopURI a try against their recent data, as Daniel Quinlan did with >> URIDNSBL + SURBL, and kindly let us know what kind of results they >> obtain? Currently four trailing days of SpamCop URI reports are >> represented in SURBL. > 2.6x modules, rules, and patches aren't very interesting right now. > Give me a patch against URIDNSBL in 3.0 to add domain-to-domain testing > and I'll gladly give it a whirl. I would do that immediately if I knew how to write one. I've been rewriting my data stuff lately, while letting Eric update SpamCopURI to now use SURBL. (The somewhat frustrating thing is that someone already familiar with SA 3.0 plugins could probably make such a patch for URIDNSBL in a small fraction of the time it would take me to come up to speed. But I realize everyone else is short of time also.) > Four days still seems rather low. What would be a better expiration time, and how do you suggest removing from the blacklist domains that are no longer active in spams? We can expire after any arbitrary number of days. I'm leaning towards seven days right now since it's a typical DNS cacheout interval. > Bear in mind that we're testing > corpora that have spams somewhere between 0 and 3 months old (on > average). SpamCop is very hard to accurately gauge because stuff leaves > so quickly. True, but it also accurately reflects spams that people are actually getting and reporting at any given moment. To me that feature has a significant value in timeliness. If it's the case that domains expire out of the SpamCop URI data sooner than the particular spam domains remain a problem, then I could definitely see a need for a longer expiration. Being somewhat new to the game, I don't have any data to support either argument. My intuition is that if a domain continued to appear in spam, people would continue to report it, and it would therefore continue to show up in our SURBL data. I'm interested in finding out what I may be overlooking in this assumption. Do you or anyone else here have some data that might shed some light on this question? > Expiring stuff quickly doesn't really reduce FPs unless > you're testing old ham vs. new spam. I care more about the S/O ratio > (spam/overall where overall=ham+spam for a 50/50 mix of spam and ham). My priorities are near zero FPs and near 100% accuracy in the spams we do tag. I don't guarantee that we will tag all spams, but I'd like the ones we say are spam to actually *be* spam. Verity is important to me. Other techniques may be able to catch spams which we miss, and we may be able to improve our process to catch more spams our way. I also think our spam% will be very high if the SpamCop reports represent a good cross-section of actual spams at any given time. Comments? Surely I'm missing something... ;) Jeff C. -- Jeff Chan mailto:[EMAIL PROTECTED] http://www.surbl.org/
