Hello Bill, Thank you so much for your views. I agree that your customers would not like it if you share information. But Oliver suggested , I need only the source IP addresses of the Spam and Ham emails , which can even be anonymized in the last octet.
Will that still be a privacy concern? On Tue, Jun 28, 2016 at 9:04 PM, Bill Cole < sausers-20150...@billmail.scconsult.com> wrote: > On 28 Jun 2016, at 20:33, Shivram Krishnan wrote: > > Hey Guys, >> >> I am a researcher at the University of Southern California ( >> https://steel.isi.edu/ ), and I have been working on making Blacklists >> more effective by combining different sources of Blacklists, and creating >> a >> Blacklists specific for a particular network. >> >> Though I have devised a mechanism to generate these blacklists, I am not >> finding a suitable evaluation metric. It would be great if somebody could >> give me a dataset of source IP addresses of emails received by your >> network >> which have been marked as HAM/SPAM by Spamassassin for the year 2016. I >> do >> not require the entire SPAM/HAM emails. Using this , I could evaluate the >> false positives/ true positives of my technique. >> >> I had posted a similar question in the previous thread, but that did not >> get much response. >> >> Looking forward to your replies! >> > > You may be disappointed... > > It is extremely difficult to get a diverse corpus of non-bulk "Ham" for > any form of research because most people consider such mail private, even > when it is not very well protected in practice. I help run mail systems for > a number of small and medium sized businesses, which handle thousands of > pieces of one-to-one mail and about as much non-spam B2B bulk mail every > day. It is functionally impossible for me to try to provide a researcher > any of that due to real personal privacy and business confidentiality > issues and the hard cost of the simple logistics of getting permission and > gathering the mail. Consumer ISPs would be even less able and willing than > corporate mail operators, because most of them have highly impersonal > relationships with their customers and have staff to users ratios that make > such a project inconceivable without necessarily deceiving users: their > only way would be to hide permission deep in a TOS no one reads. > > The simple bottom line: Mail system operators are not able to give you the > corpus you need. It's not that we wouldn't like to be able to, but we don't > have it assembled because we can't ethically do so without investing huge > amounts of time we do not have. > > HOWEVER... > > USC is in a particularly good position to be collecting such a corpus. " > isi.edu" and "usc.edu are two of the oldest domains on the Internet and > unless they are very strange among such domains, they are targeted by huge > streams of spam. Less noticeable: they are probably targeted by substantial > streams of non-spam "oops" mail. I'd bet that Jon Postel's old address gets > targeted by hundreds of innocent messages every day, along with thousands > of pieces of spam. Someone could hand-classify that. You also have the > advantage of being a sizable private university, so you aren't limited by > some of the rules public universities have, and you could probably do more > than a public school -- for example, UCLA or UCB -- could to encourage > students and staff to let you do research on their Ham. > > IOW: MY users would resist or refuse me handing you their private and/or > confidential mail for "research" they don't care about, I can't quickly > explain, and none of us is being paid to help. OTOH: YOUR institution is > well-situated to generate exactly the corpus you seek. >