Available datasets for recommendations

2011-07-07 Thread Lance Norskog
What recommendation datasets, that are available, are considered "large" by Mahout testing standards? Yahoo KDD Cup is offline, the Netflix data went under a cloud... -- Lance Norskog goks...@gmail.com

Re: Available datasets for recommendations

2011-07-07 Thread Ted Dunning
Those are both reasonably large, but not commercial in scale. At Veoh, we had about 10 non-zero elements in our raw data. I think Netflix has 100 million. On Thu, Jul 7, 2011 at 8:05 PM, Lance Norskog wrote: > What recommendation datasets, that are available, are considered > "large" by Mahout

Re: Available datasets for recommendations

2011-07-07 Thread Alex Kozlov
There is still a libimseti dataset http://www.occamslab.com/petricek/datawith 17,359,346 ratings. People are scared after the Netflix lawsuit. On Thu, Jul 7, 2011 at 10:17 PM, Ted Dunning wrote: > Those are both reasonably large, but not commercial in scale. > > At Veoh, we had about 10 non-zer

Re: Available datasets for recommendations

2011-07-07 Thread web service
Is it taken offline as well ? On Thu, Jul 7, 2011 at 10:40 PM, Alex Kozlov wrote: > There is still a libimseti dataset > http://www.occamslab.com/petricek/datawith 17,359,346 ratings. People > are scared after the Netflix lawsuit. > > On Thu, Jul 7, 2011 at 10:17 PM, Ted Dunning > wrote: > > >

Re: Available datasets for recommendations

2011-07-08 Thread Sean Owen
The link is http://www.occamslab.com/petricek/data/ The KDD or Netflix data are plenty big to play with. How big is big for your purpose? On Fri, Jul 8, 2011 at 7:05 AM, web service wrote: > Is it taken offline as well ? > > On Thu, Jul 7, 2011 at 10:40 PM, Alex Kozlov wrote: > > > There is st

Re: Available datasets for recommendations

2011-07-08 Thread Sebastian Schelter
Another dataset to play with is this compilation of song listenings scraped from the last.fm API: http://mtg.upf.edu/node/1671. Should include about 20M ratings. --sebastian On 08.07.2011 09:17, Sean Owen wrote: The link is http://www.occamslab.com/petricek/data/ The KDD or Netflix data are

Re: Available datasets for recommendations

2011-07-08 Thread Lance Norskog
Thanks. Netflix & Yahoo KDD were my first choice, but are gone. It did not occur to me that stashing such things away would be wise; packrat though I am. Purpose is testing large user/item or document'/term databases. On Fri, Jul 8, 2011 at 12:44 AM, Sebastian Schelter wrote: > Another dataset

Re: Available datasets for recommendations

2011-07-08 Thread Steven Bourke
Movielens would be the one thats most commonly used by researchers, they have a 100k, 1million and 10 million ratings dataset. On Fri, Jul 8, 2011 at 10:26 AM, Lance Norskog wrote: > Thanks. > > Netflix & Yahoo KDD were my first choice, but are gone. It did not > occur to me that stashing such

Re: Available datasets for recommendations

2011-07-08 Thread Grant Ingersoll
It's not a traditional ratings corpus, but the ASF mail archives I put up all have clear provenance and are freely available and I don't think it is too hard to make a recommender problem out of them, likely based on the replies. There are 6m+ items in it. And now that Amazon has free inbound,

Re: Available datasets for recommendations

2011-07-08 Thread Lance Norskog
Ratings and more generally "parallel universe" or "dual space" or "dyadic" (but that is other things): Correspondences between samples in two different parallel spaces. A mail corpus has different kinds of gleanable knowledge: word/subject line correspondences, authoritative mail v.s. conversation

Re: Available datasets for recommendations

2011-07-08 Thread Dan Brickley
On 9 July 2011 00:03, Lance Norskog wrote: > Ratings and more generally "parallel universe" or "dual space" or > "dyadic" (but that is other things): Correspondences between samples > in two different parallel spaces. > > A mail corpus has different kinds of gleanable knowledge: word/subject > lin

Re: Available datasets for recommendations

2011-07-08 Thread Lance Norskog
Thanks. That final one is now in torrent-land. The RDF SKOS stuff reminds me: RDF seems pointless without a confidence factor per triple. It would be cool to take a large self-inconsistent RDF-triple graph and grind out a set of confidence factors that makes it consistent. "This stuff is good, thi

Re: Available datasets for recommendations

2011-07-09 Thread Andrew Clegg
There's a Mendeley data set available on request: http://dev.mendeley.com/datachallenge/ 5M documents (scientific papers etc.) x 50K readers -- not huge, but an interesting problem space. On 8 July 2011 23:23, Dan Brickley wrote: > On 9 July 2011 00:03, Lance Norskog wrote: >> Ratings and more