> *From: *Susan Biancani <inacn...@gmail.com>
> *Subject: **[Wiki-research-l] diffdb formatted Wikipedia dump*
> *Date: *October 3, 2013 10:06:44 PM PDT
> *To: *wiki-research-l@lists.wikimedia.org
> *Reply-To: *Research into Wikimedia content and communities <
> wiki-research-l@lists.wikimedia.org>
>
> I'm looking for a dump from English Wikipedia in diff format (i.e. each
> entry is the text that was added/deleted since the last edit, rather than
> each entry is the current state of the page).
>
> The Summer of Research folks provided a handy guide to how to create such
> a dataset from the standard complete dumps here:
> http://meta.wikimedia.org/wiki/WSoR_datasets/revision_diff
> But the time estimate they give is prohibitive for me (20-24 hours for
> each dump file--there are currently 158--running on 24 cores). I'm a grad
> student in a social science department, and don't have access to extensive
> computing power. I've been paying out of pocket for AWS, but this would get
> expensive.
>
> There is a diff-format dataset available, but only through April, 2011
> (here: http://dumps.wikimedia.org/other/diffdb/). I'd like to get a
> diff-format dataset for January, 2010- March, 2013 (or, for everything up
> to March, 2013).
>
> Does anyone know if such a dataset exists somewhere? Any leads or
> suggestions would be much appreciated!
>
> Hi Susan,

There is no newer version of the dataset then you have found, that's the
bad news. The good news is that the dataset was used with really slow
commodity hardware -- what you could do is run it on AWS using a smaller
dataset, for example the Dutch Wikipedia and see how long it takes. An
alternative solution would be to start thinking (with other researchers and
Wikimedia community members) of having a small Hadoop cluster in Labs with
only public data. That way you don't need to pay but obviously it will be
less performant.   The Analytics has puppet manifests ready that will build
an entire hadoop cluster.

The wikimedia-analytics mailinglist is a good place for such a conversation
or if you need more hands on help with the diffdb then please com to irc:
wikimedia-analytics.

Best,
Diederik
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Reply via email to