I ran into this problem recently.  A python script is available at 
https://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/Offline/mwimport.py,
 that will convert .xml.bz2 dumps into flat fast-import files which can be 
loaded into most databases.  Sorry this tool is still alpha quality.

Feel free to contact with problems.

-Adam Wight

j...@sahnwaldt.de:
> mwdumper seems to work for recent dumps:
> http://lists.wikimedia.org/pipermail/mediawiki-l/2012-May/039347.html
> 
> On Tue, Jun 12, 2012 at 11:19 PM, Steve Bennett <stevag...@gmail.com> wrote:
> > Hi all,
> >  I've been tasked with setting up a local copy of the English
> > Wikipedia for researchers - sort of like another Toolserver. I'm not
> > having much luck, and wondered if anyone has done this recently, and
> > what approach they used? We only really need the current article text
> > - history and meta pages aren't needed.
> >
> > Things I have tried:
> > 1) Downloading and mounting the SQL dumps
> >
> > No good because they don't contain article text
> >
> > 2) Downloading and mounting other SQL "research dumps" (eg
> > ftp://ftp.rediris.es/mirror/WKP_research)
> >
> > No good because they're years out of date
> >
> > 3) Using WikiXRay on the enwiki-latest-pages-meta-history?.xml-.....xml 
> > files
> >
> > No good because they decompress to astronomically large. I got about
> > halfway through decompressing them and was over 7Tb.
> >
> > Also, WikiXRay appears to be old and out of date (although
> > interestingly its author Felipe Ortega has just committed to the
> > gitorious repository[1] on Monday for the first time in over a year)
> >
> > 4) Using MWDumper (http://www.mediawiki.org/wiki/Manual:MWDumper)
> >
> > No good because it's old and out of date: it only supports export
> > version 0.3, and the current dumps are 0.6
> >
> > 5) Using importDump.php on a latest-pages-articles.xml dump [2]
> >
> > No good because it just spews out 7.6Gb of this output:
> >
> > PHP Warning:  xml_parse(): Unable to call handler in_() in
> > /usr/share/mediawiki/includes/Import.php on line 437
> > PHP Warning:  xml_parse(): Unable to call handler out_() in
> > /usr/share/mediawiki/includes/Import.php on line 437
> > PHP Warning:  xml_parse(): Unable to call handler in_() in
> > /usr/share/mediawiki/includes/Import.php on line 437
> > PHP Warning:  xml_parse(): Unable to call handler in_() in
> > /usr/share/mediawiki/includes/Import.php on line 437
> > ...
> >
> >
> > So, any suggestions for approaches that might work? Or suggestions for
> > fixing the errors in step 5?
> >
> > Steve
> >
> >
> > [1] http://gitorious.org/wikixray
> > [2] 
> > http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
> >
> > _______________________________________________
> > Wikitech-l mailing list
> > Wikitech-l@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> 
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to