Thanks. Netflix & Yahoo KDD were my first choice, but are gone. It did not occur to me that stashing such things away would be wise; packrat though I am.
Purpose is testing large user/item or document'/term databases. On Fri, Jul 8, 2011 at 12:44 AM, Sebastian Schelter <s...@apache.org> wrote: > Another dataset to play with is this compilation of song listenings scraped > from the last.fm API: > > http://mtg.upf.edu/node/1671. > > Should include about 20M ratings. > > --sebastian > > On 08.07.2011 09:17, Sean Owen wrote: >> >> The link is http://www.occamslab.com/petricek/data/ >> >> The KDD or Netflix data are plenty big to play with. How big is big for >> your >> purpose? >> >> On Fri, Jul 8, 2011 at 7:05 AM, web service<wbs...@gmail.com> wrote: >> >>> Is it taken offline as well ? >>> >>> On Thu, Jul 7, 2011 at 10:40 PM, Alex Kozlov<ale...@cloudera.com> wrote: >>> >>>> There is still a libimseti dataset >>>> http://www.occamslab.com/petricek/datawith 17,359,346 ratings. People >>>> are scared after the Netflix lawsuit. >>>> >>>> On Thu, Jul 7, 2011 at 10:17 PM, Ted Dunning<ted.dunn...@gmail.com> >>>> wrote: >>>> >>>>> Those are both reasonably large, but not commercial in scale. >>>>> >>>>> At Veoh, we had about 10 non-zero elements in our raw data. I think >>>>> Netflix >>>>> has 100 million. >>>>> >>>>> On Thu, Jul 7, 2011 at 8:05 PM, Lance Norskog<goks...@gmail.com> >>> >>> wrote: >>>>> >>>>>> What recommendation datasets, that are available, are considered >>>>>> "large" by Mahout testing standards? Yahoo KDD Cup is offline, the >>>>>> Netflix data went under a cloud... >>>>>> >>>>>> -- >>>>>> Lance Norskog >>>>>> goks...@gmail.com >>>>>> >>>>> >>>> >>> >> > > -- Lance Norskog goks...@gmail.com