On 29 Jun 2016, at 11:38, Shivram Krishnan wrote:
Hello Bill,
There has been enough research which has been done in this field were
the
authors have obtained the data from network operators. This
<http://repository.upenn.edu/cgi/viewcontent.cgi?article=1962&context=cis_reports>
for
instance is a paper from UPenn, which has collected over 31 million
Mail
Headers (not only IP address) to validate their method.
As Anthony has pointed out, they got those from their own mail system.
One mail system, coherent mail filtering practices, administrators with
powerful disincentives to falsify data, and enough detail in the data
that it would be extremely difficult to significantly falsify without
being obvious.
We are trying to get HAM/SPAM lists from different networks, to
validate
our technique, which curates Blacklists for specific Network.
I understand that. Unfortunately, you are trying to get data in a way
that CANNOT get you trustworthy or even coherent data. Mail filtering
statistics between different systems wouldn't just differ
quantitatively, they would would be qualitatively different in scope and
meaning. I help run multiple mail systems, each of which has a unique
profile of how much and what sorts of mail reaches the point of
SpamAssassin scoring (always the final line of defense) and the
mechanisms by which attempts to send mail are stopped before the SMTP
DATA command. In some cases, substantial shunning of address space known
to be controlled by spammers happens in border routers and isn't really
countable; even if we had persistent logs of every intentional port 25
SYN packet drop, I couldn't know whether a trio of those from one IP in
a minute represents one spam attempt or three.
Even stipulating that no one would feed you intentionally false data (a
very dubious stipulation) if you received data from a diverse set of
mail systems you would be getting datasets with divergent semantics.