On 9 July 2011 00:03, Lance Norskog <goks...@gmail.com> wrote: > Ratings and more generally "parallel universe" or "dual space" or > "dyadic" (but that is other things): Correspondences between samples > in two different parallel spaces. > > A mail corpus has different kinds of gleanable knowledge: word/subject > line correspondences, authoritative mail v.s. conversational, reply-to > is a one-way relationship in the same space, time series aspects, and > more. It would be a good base for an examples/ set of several > algorithms and interpreting all concepts. That's a trimester course.
On the mail front, just to mention I finally found a tool capable of downloading my over-sized Gmail mail archives: http://toroid.org/ams/etc/gmail-imap-mirror It also preserves tags. And since I've got Gmail filters tagging almost every incoming mail, at least those from lists, then this creates over time a nice repository of associations from posters to mailing lists to folder label tags. I've been thinking to throw this soup into Mahout but haven't really thought through exactly what to try first. I thought it would be nice to cluster tags and people, for example. Another family of dataset I mentioned recently: there are lots of "Linked Data" collections opening up, from libraries, museums and other public sector bodies. See cloud visualisation at http://richard.cyganiak.de/2007/10/lod/ or the directory of datasets at http://ckan.net/ from which this is generated. The library/museum and cultural heritage collections in this scene often use SKOS, which is an RDF vocabulary for representing topics (thesaurus-like stuff). So you get some interesting structure there, and often a database that is a set of records which are tagged with one or more SKOS URI. So again, not classic recommendation dataset but a lot still worth digging into. Bibliographic data: http://ckan.net/group/bibliographic Nearby: draft report from W3C Linked Library incubator group, http://lists.w3.org/Archives/Public/public-lld/2011Jun/0084.html ... including lengthy 'vocabularies and datasets' report, http://www.w3.org/2005/Incubator/lld/wiki/Vocabulary_and_Dataset More on SKOS, see http://www.w3.org/2004/02/skos/ British Library's "national bibliography" at http://www.bl.uk/bibliographic/datasamples.html http://ckan.net/package/jiscopenbib-bl_bnb-1 http://openbiblio.net/2010/11/22/querying-the-british-national-bibliography/ ...this gives you about 3 million records. Some of which data surfaces in http://bibliographica.org/ ... and nearby there are things like http://openlibrary.org/ not to mention DBpedia.org, Freebase.com and similar. So while there might be a shortage of classic "users x items" entertainment-content recommender datasets, there are many many other interesting collections being released as open data, on weekly basis. cheers, Dan ps. many Twitter crawls have vanished, but take a close look at the html for http://snap.stanford.edu/data/twitter7.html ...