Re: Corpus of Spam/Ham headers(Source IP) for research

Bill Cole Tue, 28 Jun 2016 21:05:50 -0700

On 28 Jun 2016, at 20:33, Shivram Krishnan wrote:

Hey Guys,
I am a researcher at the University of Southern California (
https://steel.isi.edu/ ), and I have been working on makingBlacklistsmore effective by combining different sources of Blacklists, andcreating a
Blacklists specific for a particular network.
Though I have devised a mechanism to generate these blacklists, I amnotfinding a suitable evaluation metric. It would be great if somebodycouldgive me a dataset of source IP addresses of emails received by yournetworkwhich have been marked as HAM/SPAM by Spamassassin for the year 2016.I donot require the entire SPAM/HAM emails. Using this , I could evaluatethe
false positives/ true positives of my technique.
I had posted a similar question in the previous thread, but that didnot
get much response.

Looking forward to your replies!


You may be disappointed...

It is extremely difficult to get a diverse corpus of non-bulk "Ham" forany form of research because most people consider such mail private,even when it is not very well protected in practice. I help run mailsystems for a number of small and medium sized businesses, which handlethousands of pieces of one-to-one mail and about as much non-spam B2Bbulk mail every day. It is functionally impossible for me to try toprovide a researcher any of that due to real personal privacy andbusiness confidentiality issues and the hard cost of the simplelogistics of getting permission and gathering the mail. Consumer ISPswould be even less able and willing than corporate mail operators,because most of them have highly impersonal relationships with theircustomers and have staff to users ratios that make such a projectinconceivable without necessarily deceiving users: their only way wouldbe to hide permission deep in a TOS no one reads.

The simple bottom line: Mail system operators are not able to give youthe corpus you need. It's not that we wouldn't like to be able to, butwe don't have it assembled because we can't ethically do so withoutinvesting huge amounts of time we do not have.


HOWEVER...

USC is in a particularly good position to be collecting such a corpus."isi.edu" and "usc.edu are two of the oldest domains on the Internet andunless they are very strange among such domains, they are targeted byhuge streams of spam. Less noticeable: they are probably targeted bysubstantial streams of non-spam "oops" mail. I'd bet that Jon Postel'sold address gets targeted by hundreds of innocent messages every day,along with thousands of pieces of spam. Someone could hand-classifythat. You also have the advantage of being a sizable private university,so you aren't limited by some of the rules public universities have, andyou could probably do more than a public school -- for example, UCLA orUCB -- could to encourage students and staff to let you do research ontheir Ham.

IOW: MY users would resist or refuse me handing you their private and/orconfidential mail for "research" they don't care about, I can't quicklyexplain, and none of us is being paid to help. OTOH: YOUR institution iswell-situated to generate exactly the corpus you seek.

Re: Corpus of Spam/Ham headers(Source IP) for research

Reply via email to