There's a Mendeley data set available on request:

http://dev.mendeley.com/datachallenge/

5M documents (scientific papers etc.) x 50K readers -- not huge, but
an interesting problem space.

On 8 July 2011 23:23, Dan Brickley <dan...@danbri.org> wrote:
> On 9 July 2011 00:03, Lance Norskog <goks...@gmail.com> wrote:
>> Ratings and more generally "parallel universe" or "dual space" or
>> "dyadic" (but that is other things): Correspondences between samples
>> in two different parallel spaces.
>>
>> A mail corpus has different kinds of gleanable knowledge: word/subject
>> line correspondences, authoritative mail v.s. conversational, reply-to
>> is a one-way relationship in the same space, time series aspects, and
>> more. It would be a good base for an examples/ set of several
>> algorithms and interpreting all concepts. That's a trimester course.
>
> On the mail front, just to mention I finally found a tool capable of
> downloading my over-sized Gmail mail archives:
>
> http://toroid.org/ams/etc/gmail-imap-mirror
>
> It also preserves tags. And since I've got Gmail filters tagging
> almost every incoming mail, at least those from lists, then this
> creates over time a nice repository of associations from posters to
> mailing lists to folder label tags. I've been thinking to throw this
> soup into Mahout but haven't really thought through exactly what to
> try first. I thought it would be nice to cluster tags and people, for
> example.
>
> Another family of dataset I mentioned recently: there are lots of
> "Linked Data" collections opening up, from libraries, museums and
> other public sector bodies. See cloud visualisation at
> http://richard.cyganiak.de/2007/10/lod/ or the directory of datasets
> at http://ckan.net/ from which this is generated.
>
> The library/museum and cultural heritage collections in this scene
> often use SKOS, which is an RDF vocabulary for representing topics
> (thesaurus-like stuff). So you get some interesting structure there,
> and often a database that is a set of records which are tagged with
> one or more SKOS URI. So again, not classic recommendation dataset but
> a lot still worth digging into.
>
> Bibliographic data: http://ckan.net/group/bibliographic
>
> Nearby: draft report from W3C Linked Library incubator group,
> http://lists.w3.org/Archives/Public/public-lld/2011Jun/0084.html ...
> including lengthy 'vocabularies and datasets' report,
> http://www.w3.org/2005/Incubator/lld/wiki/Vocabulary_and_Dataset
>
> More on SKOS, see http://www.w3.org/2004/02/skos/
>
> British Library's "national bibliography" at
> http://www.bl.uk/bibliographic/datasamples.html
> http://ckan.net/package/jiscopenbib-bl_bnb-1
> http://openbiblio.net/2010/11/22/querying-the-british-national-bibliography/
>
> ...this gives you about 3 million records. Some of which data surfaces
> in http://bibliographica.org/ ... and nearby there are things like
> http://openlibrary.org/ not to mention DBpedia.org, Freebase.com and
> similar.
>
> So while there might be a shortage of classic "users x items"
> entertainment-content recommender datasets, there are many many other
> interesting collections being released as open data, on weekly basis.
>
> cheers,
>
> Dan
>
>
> ps. many Twitter crawls have vanished, but take a close look at the
> html for http://snap.stanford.edu/data/twitter7.html  ...
>



-- 

http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg

Reply via email to