> I read here http://lemurproject.org/clueweb09/ that there is a hosted > version of ClueWeb09 (the latest is ClueWeb12, for which I don't find a > hosted version), and to get access to it, someone from the ASF will need > to sign an Organizational Agreement with them as well as each individual > in the project will need to sign an Individual Agreement (retained by the > ASF). Perhaps this can be available only to committers.
This is nice! I'll try to ask ASF about this. > To this day, I think the only way it will happen is for the "community" > to build a completely open system, perhaps based off of Common Crawl or > our own crawl and host it ourselves and develop judgments, etc. Yeah, this is what we need in ORP. > Most people like the idea, but are not sure how to distribute it in an > open way (ClueWeb comes as 4 1TB disks right now) and I am also not sure > how they would handle any copyright/redaction claims against it. There > is, of course, little incentive for those involved to solve these, either, > as most people who are interested sign the form and pay the $600 for the > disks. Sigh, yes, it is hard to make a data set totally public. Actually, one of my purpose in this question is to see whether it is acceptable in our community (i.e. lucene/solr only) to obtain a data set not open to all people. When expand to a larger scope, the license issue is somewhat hairy... And since Shai has found a possible 'free' data set, I think it is possible for ASF to obtain an Organizational Agreement for this. I'll try to contact ASF & CMU about how they define "person with the authority" in OSS. On Tue, Sep 17, 2013 at 6:11 AM, Grant Ingersoll <gsing...@apache.org>wrote: > Inline below > > On Sep 9, 2013, at 10:53 PM, Han Jiang <jiangha...@gmail.com> wrote: > > Back in 2007 Grant contacted with NIST about making TREC collection > available to our community: > > http://mail-archives.apache.org/mod_mbox/lucene-dev/200708.mbox/browser > > I think a try for this is really important to our project and people who > use Lucene. All these years the speed performance is mainly tuned on > Wikipedia, however it's not very 'standard': > > * it doesn't represent how real-world search works; > * it cannot be used to evaluate the relevance of our scoring models; > * researchers tend to do experiments on other data sets, and usually it is > hard to know whether Lucene performs its best performance; > > And personally I agree with this line: > > > I think it would encourage Lucene users/developers to think about > > relevance as much as we think about speed. > > There's been much work to make Lucene's scoring models pluggable in 4.0, > and it'll be great if we can explore more about it. It is very appealing > to > see a high-performance library work along with state-of-the-art ranking > methods. > > > And about TREC data set, the problems we met are: > > 1. NIST/TREC does not own the original collections, therefore it might be > necessary to have direct contact with those organizations who really > did, > such as: > > http://ir.dcs.gla.ac.uk/test_collections/access_to_data.html > http://lemurproject.org/clueweb12/ > > 2. Currently, there is no open-source license for any of the data sets, so > it won't be as 'open' as Wikipedia is. > > As is proposed by Grant, a possibility is to make the data set > accessible > only to committers instead of all users. It is not very open-source > then, > but TREC data sets is public and usually available to researchers, so > people can still reproduce performance test. > > I'm quite curious, has anyone explored getting an open-source license for > one of those data sets? And is our community still interested about this > issue after all these years? > > > It continues to be of interest to me. I've had various conversations > throughout the years on it. Most people like the idea, but are not sure > how to distribute it in an open way (ClueWeb comes as 4 1TB disks right > now) and I am also not sure how they would handle any copyright/redaction > claims against it. There is, of course, little incentive for those > involved to solve these, either, as most people who are interested sign the > form and pay the $600 for the disks. I've had a number of conversations > about how I view this to be a significant barrier to open research, esp. in > under-served countries and to open source. People sympathize with me, but > then move on. > > To this day, I think the only way it will happen is for the "community" to > build a completely open system, perhaps based off of Common Crawl or our > own crawl and host it ourselves and develop judgments, etc. We tried to > get this off the ground w/ the Open Relevance Project, but there was never > a sustainable effort, and thus I have little hope at this point for it (but > I would love to be proven wrong) For it to succeed, I think we would need > the backing of a University with students interested in curating such a > collection, the judgments, etc. I think we could figure out how to > distribute the data either as an AWS public data set or possibly via the > ASF or similar (although I am pretty sure the ASF would balk at multi-TB > sized downloads). > > Happy to hear other ideas. > > -------------------------------------------- > Grant Ingersoll | @gsingers > http://www.lucidworks.com > > > > > > -- Han Jiang Team of Search Engine and Web Mining, School of Electronic Engineering and Computer Science, Peking University, China