Hi,
        I hate to be resurrecting an old thread, but I think for the purpose of 
completion I would like to post my experience with the Import of XML 
Dumps of Wikipedia into Mediawiki, so that it would help someone else 
looking for this information. I started this thread after all.

        I was attempting to import the XML/SQL dumps of the English Wikipedia 
http://download.wikimedia.org/enwiki/20081008/ (not the most recent 
version) using the three methods described at 
http://www.mediawiki.org/wiki/Manual:Importing_XML_dumps

I.      Using importDump.php:
While this is the recommended method, I did run into memory issues. The 
PHP (CLI) runs out of memory after a day or two, and then you have to 
restart the import. (The good thing is that it skips quickly over pages 
it is already imported after the restart.) However the fact that this 
crashed too many times made me give up on it.

II.     Using mwdumper:
This is actually pretty fast, and does not give errors. However I could 
not figure out why this imports only 6.1 Million Pages, as compared to 
7.6 Millon pages in the dump mentioned above (not the most recent dump.) 
The command line output correctly indicates that 7.6 M pages have been 
processed – but when you count the entries in the page table, only 6.1M 
show up. I don’t know what happens to the rest, because as far as I can 
see there were no errors.

III.    Using xml2sql:
Actually this is not the recommended way of importing the XML dumps 
according to http://meta.wikimedia.org/wiki/Xml2sql - but it is the only 
way that really worked for me. However as compared to the other tools, 
this needs to be compiled/installed to get it to work. As Joshua 
suggested a simple:
$   xml2sql enwiki-20081008-pages-articles.xml
$  mysqlimport -u root -p --local wikidb ./{page,revision,text}.txt

worked for me.

Notes: Your local MediaWiki will still not look like the online wiki 
(even after you take into account that Images do not come with these 
dumps).
1.      For that I first imported the SQL Dumps into the other tables that 
were available at http://download.wikimedia.org/enwiki/20081008/ (except 
page – since you have already imported it by now.)
2.      I next installed the extensions listed in the “Parser hooks” section 
under “Installed extensions” on 
http://en.wikipedia.org/wiki/Special:Version
3.      Finally, I recommend that you use HTML Tidy, because even after the 
above steps, the output is screwed up. The settings for HTML Tidy are in 
the LocalSettings.php. These are not there by default, you need to get 
them from includes/DefaultSettings.php. The settings that worked for me 
were:
$wgUseTidy = true;
$wgAlwaysUseTidy = false;
$wgTidyBin = '/usr/bin/tidy';
$wgTidyConf = $IP.'/includes/tidy.conf';
$wgTidyOpts = '';
$wgTidyInternal = extension_loaded( 'tidy' );

And

$wgValidateAllHtml = false;

Ensure this last one is false - else you would get nothing for most of 
the pages.

I hope the above information helps others who also want to Import of XML 
Dumps of Wikipedia into Mediawiki.

Thanks to all who answered my posts,
O. O.


_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to