On 28 Jun 2016, at 20:33, Shivram Krishnan wrote:
Hey Guys,
I am a researcher at the University of Southern California (
https://steel.isi.edu/ ), and I have been working on making
Blacklists
more effective by combining different sources of Blacklists, and
creating a
Blacklists specific for a particular network.
Though I have devised a mechanism to generate these blacklists, I am
not
finding a suitable evaluation metric. It would be great if somebody
could
give me a dataset of source IP addresses of emails received by your
network
which have been marked as HAM/SPAM by Spamassassin for the year 2016.
I do
not require the entire SPAM/HAM emails. Using this , I could evaluate
the
false positives/ true positives of my technique.
I had posted a similar question in the previous thread, but that did
not
get much response.
Looking forward to your replies!
You may be disappointed...
It is extremely difficult to get a diverse corpus of non-bulk "Ham" for
any form of research because most people consider such mail private,
even when it is not very well protected in practice. I help run mail
systems for a number of small and medium sized businesses, which handle
thousands of pieces of one-to-one mail and about as much non-spam B2B
bulk mail every day. It is functionally impossible for me to try to
provide a researcher any of that due to real personal privacy and
business confidentiality issues and the hard cost of the simple
logistics of getting permission and gathering the mail. Consumer ISPs
would be even less able and willing than corporate mail operators,
because most of them have highly impersonal relationships with their
customers and have staff to users ratios that make such a project
inconceivable without necessarily deceiving users: their only way would
be to hide permission deep in a TOS no one reads.
The simple bottom line: Mail system operators are not able to give you
the corpus you need. It's not that we wouldn't like to be able to, but
we don't have it assembled because we can't ethically do so without
investing huge amounts of time we do not have.
HOWEVER...
USC is in a particularly good position to be collecting such a corpus.
"isi.edu" and "usc.edu are two of the oldest domains on the Internet and
unless they are very strange among such domains, they are targeted by
huge streams of spam. Less noticeable: they are probably targeted by
substantial streams of non-spam "oops" mail. I'd bet that Jon Postel's
old address gets targeted by hundreds of innocent messages every day,
along with thousands of pieces of spam. Someone could hand-classify
that. You also have the advantage of being a sizable private university,
so you aren't limited by some of the rules public universities have, and
you could probably do more than a public school -- for example, UCLA or
UCB -- could to encourage students and staff to let you do research on
their Ham.
IOW: MY users would resist or refuse me handing you their private and/or
confidential mail for "research" they don't care about, I can't quickly
explain, and none of us is being paid to help. OTOH: YOUR institution is
well-situated to generate exactly the corpus you seek.