Hi Rami,
If I recall correctly, we use the diff library from google (
http://code.google.com/p/google-diff-match-patch/)
and the total size is about 420Gb (after decompression).
But you can also just download a couple of chunks and see if you can handle
those.
Best,
Diederik
On Fri, Nov 4, 2011
Hi Diederik,
I have two questions:
1. Which algorithm you used to get the added/removed content between two
revisions of wikipedia?
2. What is the size of the diffdb dump after extracting? I do not want
to waste wikipedia bandwidth if I know that I can not deal with it ;).
By the way
Dear Wiki Researchers,
During the summer we have worked on Wikihadoop [0], a tool that allows us
to create the diffs between two revisions of a Wiki article using Hadoop.
Now I am happy to announce that the entire diffdb is available for download
at http://http://dumps.wikimedia.org/other/diffdb/