https://bugzilla.wikimedia.org/show_bug.cgi?id=27113

           Summary: be able to restart history dump after breakage, from
                    where it was interrupted
           Product: XML Snapshots
           Version: unspecified
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: enhancement
          Priority: Normal
         Component: General
        AssignedTo: ar...@wikimedia.org
        ReportedBy: ar...@wikimedia.org
                CC: tf...@wikimedia.org
            Blocks: 27110


Dumping the page-meta-history file is the phase that takes the longest; when
some external factor causes the dumps to fail (a code push that breaks them,
network/db/power/space/other issues), they currently must be restarted from the
beginning.  Even when they complete in 2 weeks instead of 6 weeks, the odds of
something going wrong in that time is quite high.  Being able to restart from
the point of interruption would mean being able to produce them on a reasonable
schedule.

Code available: find last page id in file form interrupted run (works only for
bz2 files), by seeking to the end and walking through compressed blocks.

Code needed: stream this file to a filter which writes out the MediaWiki
header, writes everything up to but excluding the last pageID, writes the
MediaWiki footer; this output can be piped to bzip2 to produce an intact bzip2
file.

We can then run from that pageID to the end, take the two bzip2 files,
recombine them and be done.

Why can't we just find the truncated bzip2 block, toss it, and start from
there? Because at the end of a file bzip2 requires a cumulative crc algorithm,
which means rereading all the text the minute we want to add blocks at the end.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.

_______________________________________________
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

Reply via email to