Re: Corpus of Spam/Ham headers(Source IP) for research

Bill Cole Fri, 01 Jul 2016 17:00:45 -0700

On 29 Jun 2016, at 11:38, Shivram Krishnan wrote:

Hello Bill,
There has been enough research which has been done in this field werethe
authors have obtained the data from network operators. This
<http://repository.upenn.edu/cgi/viewcontent.cgi?article=1962&context=cis_reports>
for
instance is a paper from UPenn, which has collected over 31 millionMail
Headers (not only IP address) to validate their method.

As Anthony has pointed out, they got those from their own mail system.One mail system, coherent mail filtering practices, administrators withpowerful disincentives to falsify data, and enough detail in the datathat it would be extremely difficult to significantly falsify withoutbeing obvious.

We are trying to get HAM/SPAM lists from different networks, tovalidate
our technique, which curates Blacklists for specific Network.

I understand that. Unfortunately, you are trying to get data in a waythat CANNOT get you trustworthy or even coherent data. Mail filteringstatistics between different systems wouldn't just differquantitatively, they would would be qualitatively different in scope andmeaning. I help run multiple mail systems, each of which has a uniqueprofile of how much and what sorts of mail reaches the point ofSpamAssassin scoring (always the final line of defense) and themechanisms by which attempts to send mail are stopped before the SMTPDATA command. In some cases, substantial shunning of address space knownto be controlled by spammers happens in border routers and isn't reallycountable; even if we had persistent logs of every intentional port 25SYN packet drop, I couldn't know whether a trio of those from one IP ina minute represents one spam attempt or three.

Even stipulating that no one would feed you intentionally false data (avery dubious stipulation) if you received data from a diverse set ofmail systems you would be getting datasets with divergent semantics.

Re: Corpus of Spam/Ham headers(Source IP) for research

Reply via email to