There's a Mendeley data set available on request: http://dev.mendeley.com/datachallenge/
5M documents (scientific papers etc.) x 50K readers -- not huge, but an interesting problem space. On 8 July 2011 23:23, Dan Brickley <dan...@danbri.org> wrote: > On 9 July 2011 00:03, Lance Norskog <goks...@gmail.com> wrote: >> Ratings and more generally "parallel universe" or "dual space" or >> "dyadic" (but that is other things): Correspondences between samples >> in two different parallel spaces. >> >> A mail corpus has different kinds of gleanable knowledge: word/subject >> line correspondences, authoritative mail v.s. conversational, reply-to >> is a one-way relationship in the same space, time series aspects, and >> more. It would be a good base for an examples/ set of several >> algorithms and interpreting all concepts. That's a trimester course. > > On the mail front, just to mention I finally found a tool capable of > downloading my over-sized Gmail mail archives: > > http://toroid.org/ams/etc/gmail-imap-mirror > > It also preserves tags. And since I've got Gmail filters tagging > almost every incoming mail, at least those from lists, then this > creates over time a nice repository of associations from posters to > mailing lists to folder label tags. I've been thinking to throw this > soup into Mahout but haven't really thought through exactly what to > try first. I thought it would be nice to cluster tags and people, for > example. > > Another family of dataset I mentioned recently: there are lots of > "Linked Data" collections opening up, from libraries, museums and > other public sector bodies. See cloud visualisation at > http://richard.cyganiak.de/2007/10/lod/ or the directory of datasets > at http://ckan.net/ from which this is generated. > > The library/museum and cultural heritage collections in this scene > often use SKOS, which is an RDF vocabulary for representing topics > (thesaurus-like stuff). So you get some interesting structure there, > and often a database that is a set of records which are tagged with > one or more SKOS URI. So again, not classic recommendation dataset but > a lot still worth digging into. > > Bibliographic data: http://ckan.net/group/bibliographic > > Nearby: draft report from W3C Linked Library incubator group, > http://lists.w3.org/Archives/Public/public-lld/2011Jun/0084.html ... > including lengthy 'vocabularies and datasets' report, > http://www.w3.org/2005/Incubator/lld/wiki/Vocabulary_and_Dataset > > More on SKOS, see http://www.w3.org/2004/02/skos/ > > British Library's "national bibliography" at > http://www.bl.uk/bibliographic/datasamples.html > http://ckan.net/package/jiscopenbib-bl_bnb-1 > http://openbiblio.net/2010/11/22/querying-the-british-national-bibliography/ > > ...this gives you about 3 million records. Some of which data surfaces > in http://bibliographica.org/ ... and nearby there are things like > http://openlibrary.org/ not to mention DBpedia.org, Freebase.com and > similar. > > So while there might be a shortage of classic "users x items" > entertainment-content recommender datasets, there are many many other > interesting collections being released as open data, on weekly basis. > > cheers, > > Dan > > > ps. many Twitter crawls have vanished, but take a close look at the > html for http://snap.stanford.edu/data/twitter7.html ... > -- http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg