Re: Corpus of Spam/Ham headers(Source IP) for research

Shivram Krishnan Tue, 28 Jun 2016 22:01:18 -0700

Hello Bill,

Thank you so much for your views. I agree that your customers would not
like it if you share information. But Oliver suggested , I need only the
source IP addresses of the Spam and Ham emails , which can even be
anonymized in the last octet.


Will that still be a privacy concern?



On Tue, Jun 28, 2016 at 9:04 PM, Bill Cole <
sausers-20150...@billmail.scconsult.com> wrote:

> On 28 Jun 2016, at 20:33, Shivram Krishnan wrote:
>
> Hey Guys,
>>
>> I am a researcher at the University of Southern California (
>> https://steel.isi.edu/ ), and I have been working on making  Blacklists
>> more effective by combining different sources of Blacklists, and creating
>> a
>> Blacklists specific for a particular network.
>>
>> Though I have devised a mechanism to generate these blacklists,  I am not
>> finding a suitable evaluation metric. It would be great if somebody could
>> give me a dataset of source IP addresses of emails received by your
>> network
>> which have been marked as HAM/SPAM by Spamassassin for the year 2016.  I
>> do
>> not require the entire SPAM/HAM emails. Using this , I could evaluate the
>> false positives/ true positives of my technique.
>>
>> I had posted a similar question in the previous thread, but that did not
>> get much response.
>>
>> Looking forward to your replies!
>>
>
> You may be disappointed...
>
> It is extremely difficult to get a diverse corpus of non-bulk "Ham" for
> any form of research because most people consider such mail private, even
> when it is not very well protected in practice. I help run mail systems for
> a number of small and medium sized businesses, which handle thousands of
> pieces of one-to-one mail and about as much non-spam B2B bulk mail every
> day. It is functionally impossible for me to try to provide a researcher
> any of that due to real personal privacy and business confidentiality
> issues and the hard cost of the simple logistics of getting permission and
> gathering the mail.  Consumer ISPs would be even less able and willing than
> corporate mail operators, because most of them have highly impersonal
> relationships with their customers and have staff to users ratios that make
> such a project inconceivable without necessarily deceiving users: their
> only way would be to hide permission deep in a TOS no one reads.
>
> The simple bottom line: Mail system operators are not able to give you the
> corpus you need. It's not that we wouldn't like to be able to, but we don't
> have it assembled because we can't ethically do so without investing huge
> amounts of time we do not have.
>
> HOWEVER...
>
> USC is in a particularly good position to be collecting such a corpus. "
> isi.edu" and "usc.edu are two of the oldest domains on the Internet and
> unless they are very strange among such domains, they are targeted by huge
> streams of spam. Less noticeable: they are probably targeted by substantial
> streams of non-spam "oops" mail. I'd bet that Jon Postel's old address gets
> targeted by hundreds of innocent messages every day, along with thousands
> of pieces of spam. Someone could hand-classify that. You also have the
> advantage of being a sizable private university, so you aren't limited by
> some of the rules public universities have, and you could probably do more
> than a public school -- for example, UCLA or UCB -- could to encourage
> students and staff to let you do research on their Ham.
>
> IOW: MY users would resist or refuse me handing you their private and/or
> confidential mail for "research" they don't care about, I can't quickly
> explain, and none of us is being paid to help. OTOH: YOUR institution is
> well-situated to generate exactly the corpus you seek.
>

Re: Corpus of Spam/Ham headers(Source IP) for research

Reply via email to