I ran into this problem recently. A python script is available at https://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/Offline/mwimport.py, that will convert .xml.bz2 dumps into flat fast-import files which can be loaded into most databases. Sorry this tool is still alpha quality.
Feel free to contact with problems. -Adam Wight j...@sahnwaldt.de: > mwdumper seems to work for recent dumps: > http://lists.wikimedia.org/pipermail/mediawiki-l/2012-May/039347.html > > On Tue, Jun 12, 2012 at 11:19 PM, Steve Bennett <stevag...@gmail.com> wrote: > > Hi all, > > I've been tasked with setting up a local copy of the English > > Wikipedia for researchers - sort of like another Toolserver. I'm not > > having much luck, and wondered if anyone has done this recently, and > > what approach they used? We only really need the current article text > > - history and meta pages aren't needed. > > > > Things I have tried: > > 1) Downloading and mounting the SQL dumps > > > > No good because they don't contain article text > > > > 2) Downloading and mounting other SQL "research dumps" (eg > > ftp://ftp.rediris.es/mirror/WKP_research) > > > > No good because they're years out of date > > > > 3) Using WikiXRay on the enwiki-latest-pages-meta-history?.xml-.....xml > > files > > > > No good because they decompress to astronomically large. I got about > > halfway through decompressing them and was over 7Tb. > > > > Also, WikiXRay appears to be old and out of date (although > > interestingly its author Felipe Ortega has just committed to the > > gitorious repository[1] on Monday for the first time in over a year) > > > > 4) Using MWDumper (http://www.mediawiki.org/wiki/Manual:MWDumper) > > > > No good because it's old and out of date: it only supports export > > version 0.3, and the current dumps are 0.6 > > > > 5) Using importDump.php on a latest-pages-articles.xml dump [2] > > > > No good because it just spews out 7.6Gb of this output: > > > > PHP Warning: xml_parse(): Unable to call handler in_() in > > /usr/share/mediawiki/includes/Import.php on line 437 > > PHP Warning: xml_parse(): Unable to call handler out_() in > > /usr/share/mediawiki/includes/Import.php on line 437 > > PHP Warning: xml_parse(): Unable to call handler in_() in > > /usr/share/mediawiki/includes/Import.php on line 437 > > PHP Warning: xml_parse(): Unable to call handler in_() in > > /usr/share/mediawiki/includes/Import.php on line 437 > > ... > > > > > > So, any suggestions for approaches that might work? Or suggestions for > > fixing the errors in step 5? > > > > Steve > > > > > > [1] http://gitorious.org/wikixray > > [2] > > http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 > > > > _______________________________________________ > > Wikitech-l mailing list > > Wikitech-l@lists.wikimedia.org > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > > _______________________________________________ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l