[Xmldatadumps-l] [Fwd: Re: possible gsoc idea, comments?]

Ariel T. Glenn Tue, 26 Mar 2013 01:54:47 -0700

Ok, my 'reply all' is failing me in this mail user agent. Anyways, third
time's a charm...

--- Begin Message ---

Woops, forgot to send this to the list.  Also forgot to add the footnote
so doing that.

A.

Στις 26-03-2013, ημέρα Τρι, και ώρα 09:18 +0200, ο/η Ariel T. Glenn
έγραψε:
> Στις 25-03-2013, ημέρα Δευ, και ώρα 23:36 -0700, ο/η Randall Farmer
> έγραψε:
> > This isn't exactly what you're looking for, but I've been playing
> > around on my own time with how to keep a dump that's compressed but
> > also allows some random access. Last weekend I ended up writing the
> > attached script, which takes an XML file and makes a simple gzipped,
> > indexed, sort-of-random-access dump:
> > 
> > 
> > - Each article is individually gzipped, then the files are
> > concatenated.
> > - gunzip -c [files] will still stream every page if your tools like
> > that.
> > - I split the dump into 8 files, matching the core count of the EC2
> > instance running the job.
> > - It generated a text index (title, redirect dest., gzip file number,
> > offset, length) you could load into memory or a database.
> > 
> > 
> > 
> > It took about 90 minutes for the gzipping/indexing, and the result was
> > about 20 GB for enwiki. I used gzip compression level 1, because I was
> > impatient. :) 
> > 
> > 
> > I can share an EC2 disk snapshot with the actual dump reformatted this
> > way, if that's at all interesting to you.
> 
> This was the idea behind the bz2 multistream dumps and the associated
> python scripts for poking around in them [1].  I made a different choice
> in the tradeoff between space and performance, compressing 100 pages
> together.
> 
> It would be nice to have a place we could put community-generated data
> (new formats, subsets of the data, etc) for other folks to re-use. Maybe
> we could convince a mirror site to host such things.  (Any takers?)
> 
> > I haven't written anything that tries to update, but:
> > 
> > 
> > - You could replace an article by just appending the compressed
> > version somewhere and updating the index.
> >   (For convenience, you might want to write a list of
> > 'holes' (byte-ranges for replaced articles) somewhere.)
> 
> I assume we need indexing for fast rerieval.
> 
> > - You could truly delete an old revision by writing zeroes over it, if
> > that's a concern.
> 
> Yes, we would have to actually delete content (includes the title and
> associated information, if the page has been deleted).  This means that
> we won't have anything like 'undeletion', we'll just be adding seemingly
> new page contents if a page is restored.
> 
> > - Once you do that, streaming all the XML requires a script smart
> > enough to skip the holes.
> 
> Yes, we need a script that will not only skip the holes but write out
> the pages and revisions 'in canonical order', which, if free blocks are
> reused, might differ considerably from the order they are stored in this
> new format.
> 
> If we compress multiple items (revision texts) togther in order to not
> completely lose on disk space, consider this scenario:  a bot goes
> through and edits all aprticles in Category X moving them to Category Y.
> If we compress all revisions of a page together we are going to do a lot
> of uncompression in order to fold those changes in.  This argues for
> compressing multiple revisions together ordered simply by revision id,
> to minimize the uncmpression and recompression of unaltered material
> during the update.  But this needs more though.
> 
> > - Once you have that script, you could use it to produce a
> > 'cleaned' (i.e., holeless) copy of the dump periodically.
> > 
> > Like you say, better formats are possible: you could compress articles
> > in batches, reuse holes for new blocks, etc. Updating an article
> > becomes a less trivial operation when you do all of that. The point
> > here is just that a relatively simple format can do a few of the key
> > things you want.
> > 
> > 
> > 
> > FWIW, I think the really interesting part of your proposal might be
> > how to package daily changes--filling the gaps in adds/changes dumps.
> > Working on the parsing/indexing/compressing script, I sort of wrestled
> > with whether what I was doing was substantially better than just
> > loading the articles into a database as blobs. I'm still not sure. But
> > more complete daily patch info is useful no matter how the data is
> > stored. 
> 
> Heh yes, I was hoping that we could make use of the  adds/changes
> material somehow to get this going :-)
> 
> > 
> > On the other hand, potentially in favor of making a dynamic dump
> > format, it could be a platform for your "reference implementation" of
> > an incremental update script. That is, Wikimedia publishes some
> > scripts that will apply updates to a dump in a particular format, in
> > hopes that random users can adapt them from updating a dump to
> > updating a MySQL database, MongoDB cluster, or whatever other
> > datastore they use.
> > 
> 
> This is harder because while we could possibly generate insert
> statements for the new rows in the page, revision and text tables and
> maybe even a series of delete statements, we can't do much for the other
> tables like pagelinks, categorylinks and so on.  I haven't even tried to
> get my head around that problem, that's for further down the road.
> 
> > Anyway, sorry for the scattered thoughts and hope some of this is
> > useful or at least thought-provoking.
> > 
> > 
> > Randall

Keep those scattered thoughts coming!

Ariel

[1]
http://lists.wikimedia.org/pipermail/xmldatadumps-l/2012-October/000606.html

> > On Mon, Mar 25, 2013 at 4:22 AM, Ariel T. Glenn <ar...@wikimedia.org>
> > wrote:
> >         So I was thinking about things I can't undertake, and one of
> >         those
> >         things is the 'dumps 2.0' which has been rolling around in the
> >         back of
> >         my mind.  The TL;DR version is: sparse compressed archive
> >         format that
> >         allows folks to add/subtract changes to it random-access
> >         (including
> >         during generation).
> >         
> >         See here:
> >         
> >         
> > https://www.mediawiki.org/wiki/Mentorship_programs/Possible_projects#XML_dumps
> >         
> >         What do folks think? Workable? Nuts? Low priority? Interested?
> >         
> >         Ariel

--- End Message ---

_______________________________________________
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l

[Xmldatadumps-l] [Fwd: Re: possible gsoc idea, comments?]

Reply via email to