Re: Available datasets for recommendations

Dan Brickley Fri, 08 Jul 2011 15:53:43 -0700

On 9 July 2011 00:03, Lance Norskog <[email protected]> wrote:
> Ratings and more generally "parallel universe" or "dual space" or
> "dyadic" (but that is other things): Correspondences between samples
> in two different parallel spaces.
>
> A mail corpus has different kinds of gleanable knowledge: word/subject
> line correspondences, authoritative mail v.s. conversational, reply-to
> is a one-way relationship in the same space, time series aspects, and
> more. It would be a good base for an examples/ set of several
> algorithms and interpreting all concepts. That's a trimester course.


On the mail front, just to mention I finally found a tool capable of
downloading my over-sized Gmail mail archives:

http://toroid.org/ams/etc/gmail-imap-mirror

It also preserves tags. And since I've got Gmail filters tagging
almost every incoming mail, at least those from lists, then this
creates over time a nice repository of associations from posters to
mailing lists to folder label tags. I've been thinking to throw this
soup into Mahout but haven't really thought through exactly what to
try first. I thought it would be nice to cluster tags and people, for
example.

Another family of dataset I mentioned recently: there are lots of
"Linked Data" collections opening up, from libraries, museums and
other public sector bodies. See cloud visualisation at
http://richard.cyganiak.de/2007/10/lod/ or the directory of datasets
at http://ckan.net/ from which this is generated.

The library/museum and cultural heritage collections in this scene
often use SKOS, which is an RDF vocabulary for representing topics
(thesaurus-like stuff). So you get some interesting structure there,
and often a database that is a set of records which are tagged with
one or more SKOS URI. So again, not classic recommendation dataset but
a lot still worth digging into.

Bibliographic data: http://ckan.net/group/bibliographic

Nearby: draft report from W3C Linked Library incubator group,
http://lists.w3.org/Archives/Public/public-lld/2011Jun/0084.html ...
including lengthy 'vocabularies and datasets' report,
http://www.w3.org/2005/Incubator/lld/wiki/Vocabulary_and_Dataset

More on SKOS, see http://www.w3.org/2004/02/skos/

British Library's "national bibliography" at
http://www.bl.uk/bibliographic/datasamples.html
http://ckan.net/package/jiscopenbib-bl_bnb-1
http://openbiblio.net/2010/11/22/querying-the-british-national-bibliography/

...this gives you about 3 million records. Some of which data surfaces
in http://bibliographica.org/ ... and nearby there are things like
http://openlibrary.org/ not to mention DBpedia.org, Freebase.com and
similar.

So while there might be a shortage of classic "users x items"
entertainment-content recommender datasets, there are many many other
interesting collections being released as open data, on weekly basis.

cheers,

Dan


ps. many Twitter crawls have vanished, but take a close look at the
html for http://snap.stanford.edu/data/twitter7.html  ...

Re: Available datasets for recommendations

Reply via email to