On 28 Jun 2016, at 20:33, Shivram Krishnan wrote:

Hey Guys,

I am a researcher at the University of Southern California (
https://steel.isi.edu/ ), and I have been working on making Blacklists more effective by combining different sources of Blacklists, and creating a
Blacklists specific for a particular network.

Though I have devised a mechanism to generate these blacklists, I am not finding a suitable evaluation metric. It would be great if somebody could give me a dataset of source IP addresses of emails received by your network which have been marked as HAM/SPAM by Spamassassin for the year 2016. I do not require the entire SPAM/HAM emails. Using this , I could evaluate the
false positives/ true positives of my technique.

I had posted a similar question in the previous thread, but that did not
get much response.

Looking forward to your replies!

You may be disappointed...

It is extremely difficult to get a diverse corpus of non-bulk "Ham" for any form of research because most people consider such mail private, even when it is not very well protected in practice. I help run mail systems for a number of small and medium sized businesses, which handle thousands of pieces of one-to-one mail and about as much non-spam B2B bulk mail every day. It is functionally impossible for me to try to provide a researcher any of that due to real personal privacy and business confidentiality issues and the hard cost of the simple logistics of getting permission and gathering the mail. Consumer ISPs would be even less able and willing than corporate mail operators, because most of them have highly impersonal relationships with their customers and have staff to users ratios that make such a project inconceivable without necessarily deceiving users: their only way would be to hide permission deep in a TOS no one reads.

The simple bottom line: Mail system operators are not able to give you the corpus you need. It's not that we wouldn't like to be able to, but we don't have it assembled because we can't ethically do so without investing huge amounts of time we do not have.

HOWEVER...

USC is in a particularly good position to be collecting such a corpus. "isi.edu" and "usc.edu are two of the oldest domains on the Internet and unless they are very strange among such domains, they are targeted by huge streams of spam. Less noticeable: they are probably targeted by substantial streams of non-spam "oops" mail. I'd bet that Jon Postel's old address gets targeted by hundreds of innocent messages every day, along with thousands of pieces of spam. Someone could hand-classify that. You also have the advantage of being a sizable private university, so you aren't limited by some of the rules public universities have, and you could probably do more than a public school -- for example, UCLA or UCB -- could to encourage students and staff to let you do research on their Ham.

IOW: MY users would resist or refuse me handing you their private and/or confidential mail for "research" they don't care about, I can't quickly explain, and none of us is being paid to help. OTOH: YOUR institution is well-situated to generate exactly the corpus you seek.

Reply via email to