Re: [Wiki-research-l] Announcing availability new dataset diffdb

2011-11-04 Thread Diederik van Liere
Hi Rami, If I recall correctly, we use the diff library from google ( http://code.google.com/p/google-diff-match-patch/) and the total size is about 420Gb (after decompression). But you can also just download a couple of chunks and see if you can handle those. Best, Diederik On Fri, Nov 4, 2011

Re: [Wiki-research-l] Announcing availability new dataset diffdb

2011-11-04 Thread Rami Al-Rfou'
Hi Diederik, I have two questions: 1. Which algorithm you used to get the added/removed content between two revisions of wikipedia? 2. What is the size of the diffdb dump after extracting? I do not want to waste wikipedia bandwidth if I know that I can not deal with it ;). By the way

[Wiki-research-l] Announcing availability new dataset diffdb

2011-11-04 Thread Diederik van Liere
Dear Wiki Researchers, During the summer we have worked on Wikihadoop [0], a tool that allows us to create the diffs between two revisions of a Wiki article using Hadoop. Now I am happy to announce that the entire diffdb is available for download at http://http://dumps.wikimedia.org/other/diffdb/