> I read here http://lemurproject.org/clueweb09/ that there is a hosted
> version of ClueWeb09 (the latest is ClueWeb12, for which I don't find a
> hosted version), and to get access to it, someone from the ASF will need
> to sign an Organizational Agreement with them as well as each individual
> in the project will need to sign an Individual Agreement (retained by the
> ASF). Perhaps this can be available only to committers.

This is nice! I'll try to ask ASF about this.

> To this day, I think the only way it will happen is for the "community"
> to build a completely open system, perhaps based off of Common Crawl or
> our own crawl and host it ourselves and develop judgments, etc.

Yeah, this is what we need in ORP.

> Most people like the idea, but are not sure how to distribute it in an
> open way (ClueWeb comes as 4 1TB disks right now) and I am also not sure
> how they would handle any copyright/redaction claims against it.  There
> is, of course, little incentive for those involved to solve these, either,
> as most people who are interested sign the form and pay the $600 for the
> disks.

Sigh, yes, it is hard to make a data set totally public. Actually, one of
my purpose in this question is to see whether it is acceptable in our
community (i.e. lucene/solr only) to obtain a data set not open to all
people. When expand to a larger scope, the license issue is somewhat
hairy...


And since Shai has found a possible 'free' data set, I think it is possible
for ASF to obtain an Organizational Agreement for this. I'll try to contact
ASF & CMU about how they define "person with the authority" in OSS.


On Tue, Sep 17, 2013 at 6:11 AM, Grant Ingersoll <gsing...@apache.org>wrote:

> Inline below
>
> On Sep 9, 2013, at 10:53 PM, Han Jiang <jiangha...@gmail.com> wrote:
>
> Back in 2007 Grant contacted with NIST about making TREC collection
> available to our community:
>
> http://mail-archives.apache.org/mod_mbox/lucene-dev/200708.mbox/browser
>
> I think a try for this is really important to our project and people who
> use Lucene. All these years the speed performance is mainly tuned on
> Wikipedia, however it's not very 'standard':
>
> * it doesn't represent how real-world search works;
> * it cannot be used to evaluate the relevance of our scoring models;
> * researchers tend to do experiments on other data sets, and usually it is
>   hard to know whether Lucene performs its best performance;
>
> And personally I agree with this line:
>
> > I think it would encourage Lucene users/developers to think about
> > relevance as much as we think about speed.
>
> There's been much work to make Lucene's scoring models pluggable in 4.0,
> and it'll be great if we can explore more about it. It is very appealing
> to
> see a high-performance library work along with state-of-the-art ranking
> methods.
>
>
> And about TREC data set, the problems we met are:
>
> 1. NIST/TREC does not own the original collections, therefore it might be
>    necessary to have direct contact with those organizations who really
> did,
>    such as:
>
>    http://ir.dcs.gla.ac.uk/test_collections/access_to_data.html
>    http://lemurproject.org/clueweb12/
>
> 2. Currently, there is no open-source license for any of the data sets, so
>    it won't be as 'open' as Wikipedia is.
>
>    As is proposed by Grant, a possibility is to make the data set
> accessible
>    only to committers instead of all users. It is not very open-source
> then,
>    but TREC data sets is public and usually available to researchers, so
>    people can still reproduce performance test.
>
> I'm quite curious, has anyone explored getting an open-source license for
> one of those data sets? And is our community still interested about this
> issue after all these years?
>
>
> It continues to be of interest to me.  I've had various conversations
> throughout the years on it.  Most people like the idea, but are not sure
> how to distribute it in an open way (ClueWeb comes as 4 1TB disks right
> now) and I am also not sure how they would handle any copyright/redaction
> claims against it.  There is, of course, little incentive for those
> involved to solve these, either, as most people who are interested sign the
> form and pay the $600 for the disks.  I've had a number of conversations
> about how I view this to be a significant barrier to open research, esp. in
> under-served countries and to open source.  People sympathize with me, but
> then move on.
>
> To this day, I think the only way it will happen is for the "community" to
> build a completely open system, perhaps based off of Common Crawl or our
> own crawl and host it ourselves and develop judgments, etc.  We tried to
> get this off the ground w/ the Open Relevance Project, but there was never
> a sustainable effort, and thus I have little hope at this point for it (but
> I would love to be proven wrong)  For it to succeed, I think we would need
> the backing of a University with students interested in curating such a
> collection, the judgments, etc.  I think we could figure out how to
> distribute the data either as an AWS public data set or possibly via the
> ASF or similar (although I am pretty sure the ASF would balk at multi-TB
> sized downloads).
>
> Happy to hear other ideas.
>
>     --------------------------------------------
> Grant Ingersoll | @gsingers
> http://www.lucidworks.com
>
>
>
>
>
>


-- 
Han Jiang

Team of Search Engine and Web Mining,
School of Electronic Engineering and Computer Science,
Peking University, China

Reply via email to