On 29 Jun 2016, at 11:38, Shivram Krishnan wrote:

Hello Bill,

There has been enough research which has been done in this field were the
authors have obtained the data from network operators. This
<http://repository.upenn.edu/cgi/viewcontent.cgi?article=1962&context=cis_reports>
for
instance is a paper from UPenn, which has collected over 31 million Mail
Headers (not only IP address) to validate their method.

As Anthony has pointed out, they got those from their own mail system. One mail system, coherent mail filtering practices, administrators with powerful disincentives to falsify data, and enough detail in the data that it would be extremely difficult to significantly falsify without being obvious.

We are trying to get HAM/SPAM lists from different networks, to validate
our technique, which curates Blacklists for specific Network.

I understand that. Unfortunately, you are trying to get data in a way that CANNOT get you trustworthy or even coherent data. Mail filtering statistics between different systems wouldn't just differ quantitatively, they would would be qualitatively different in scope and meaning. I help run multiple mail systems, each of which has a unique profile of how much and what sorts of mail reaches the point of SpamAssassin scoring (always the final line of defense) and the mechanisms by which attempts to send mail are stopped before the SMTP DATA command. In some cases, substantial shunning of address space known to be controlled by spammers happens in border routers and isn't really countable; even if we had persistent logs of every intentional port 25 SYN packet drop, I couldn't know whether a trio of those from one IP in a minute represents one spam attempt or three.

Even stipulating that no one would feed you intentionally false data (a very dubious stipulation) if you received data from a diverse set of mail systems you would be getting datasets with divergent semantics.

Reply via email to