Re: [Wikitech-l] Importing Wikipedia XML Dumps into MediaWiki
Hi, I hate to be resurrecting an old thread, but I think for the purpose of completion I would like to post my experience with the Import of XML Dumps of Wikipedia into Mediawiki, so that it would help someone else looking for this information. I started this thread after all. I was attempting to import the XML/SQL dumps of the English Wikipedia http://download.wikimedia.org/enwiki/20081008/ (not the most recent version) using the three methods described at http://www.mediawiki.org/wiki/Manual:Importing_XML_dumps I. Using importDump.php: While this is the recommended method, I did run into memory issues. The PHP (CLI) runs out of memory after a day or two, and then you have to restart the import. (The good thing is that it skips quickly over pages it is already imported after the restart.) However the fact that this crashed too many times made me give up on it. II. Using mwdumper: This is actually pretty fast, and does not give errors. However I could not figure out why this imports only 6.1 Million Pages, as compared to 7.6 Millon pages in the dump mentioned above (not the most recent dump.) The command line output correctly indicates that 7.6 M pages have been processed – but when you count the entries in the page table, only 6.1M show up. I don’t know what happens to the rest, because as far as I can see there were no errors. III.Using xml2sql: Actually this is not the recommended way of importing the XML dumps according to http://meta.wikimedia.org/wiki/Xml2sql - but it is the only way that really worked for me. However as compared to the other tools, this needs to be compiled/installed to get it to work. As Joshua suggested a simple: $ xml2sql enwiki-20081008-pages-articles.xml $ mysqlimport -u root -p --local wikidb ./{page,revision,text}.txt worked for me. Notes: Your local MediaWiki will still not look like the online wiki (even after you take into account that Images do not come with these dumps). 1. For that I first imported the SQL Dumps into the other tables that were available at http://download.wikimedia.org/enwiki/20081008/ (except page – since you have already imported it by now.) 2. I next installed the extensions listed in the “Parser hooks” section under “Installed extensions” on http://en.wikipedia.org/wiki/Special:Version 3. Finally, I recommend that you use HTML Tidy, because even after the above steps, the output is screwed up. The settings for HTML Tidy are in the LocalSettings.php. These are not there by default, you need to get them from includes/DefaultSettings.php. The settings that worked for me were: $wgUseTidy = true; $wgAlwaysUseTidy = false; $wgTidyBin = '/usr/bin/tidy'; $wgTidyConf = $IP.'/includes/tidy.conf'; $wgTidyOpts = ''; $wgTidyInternal = extension_loaded( 'tidy' ); And $wgValidateAllHtml = false; Ensure this last one is false - else you would get nothing for most of the pages. I hope the above information helps others who also want to Import of XML Dumps of Wikipedia into Mediawiki. Thanks to all who answered my posts, O. O. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Importing Wikipedia XML Dumps into MediaWiki
Mohamed Magdy wrote: I don't remember if I already mentioned this: you can split the xml file * into smaller pieces then import it using importDump.php. Use a loop to make a file like this and then run it: #!/bin/bash php maintenance/importDump.php /path/pagexml.1 wait php maintenance/importDump.php /path/pagexml.2 ... I haven't tried to start many php importDump.php processes working on different xml files simultaneously, will it work? * = http://blog.prashanthellina.com/2007/10/17/ways-to-process-and-use-wikipedia-dumps/ Thanks Mohamed – This is a good suggestion, but I am a bit vary to try it, because if I later have problems, I would not be sure if it is because I used this script to split the XML files. I understand that the script looks OK, in that it simply splits the XML files at the “/page” boundaries – but I don’t know a lot on how this would effect the final result. Thanks again, O. O. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Importing Wikipedia XML Dumps into MediaWiki
Thanks Joshua. I am intending to try two approaches. The first being to use the xml2sql and then fill the rest of the tables with the individual dumps of the Tables that are already provided in SQL. The second would be using Mwdumper – and then import the rest of the Tables using the SQL Dumps already provided to see if there are any differences. Joshua C. Lerner wrote: Thanks for making this attempt. Let me know if your rebuildall.php has memory issues. Seems fine - steady at 2.2% of memory available. This is really getting confusing for me – because there are so many ways – all of which guaranteed to work – that work, and the one that is recommended – does not seem to work. I think you mean all of which are *not* guaranteed to work. I would try out your approach too – but it would take time as I only have one computer to spare. If you want I can just send you a database dump. Either now, or after rebuildall.php all finishes. Right now, it's now refreshing the links table, but only up to page_id 34,100 out of over 2 million pages. It'll be running for days. Joshua Thanks for posting your experience with rebuildall.php. I think I might be able to live with the bad syntax that I get – if I cannot manage to get this to work. Thanks again, O. O. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Importing Wikipedia XML Dumps into MediaWiki
Daniel Kinzler wrote: That sounds very *very* odd. because page content is imported as-is in both cases, it's not processed in any way. The only thing I can imagine is that things don't look right if you don't have all the templates imported yet. Thanks Daniel. Yes, I think that this may be because the Templates are not imported. (Get a lot of Template: ...). Any suggestions on how to import the templates? I thought that the pages-articles.xml.bz2 (i.e. the XML Dump) contains the templates – but I did not find a way to do install it separately. Another thing I noticed (with the Portuguese Wiki which is a much smaller dump than the English Wiki) is that the number of pages imported by importDump.php and MWDumper differ i.e. importDump.php had much more pages than MWDumper. That is way I would have preferred to do this using importDump.php. Also in a previous post, you mentioned about taking care about the “secondary link tables”. How do I do that? Does “secondary links” refer to language links, external links, template links, image links, category links, page links or something else? Thanks for your patience O.O. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Importing Wikipedia XML Dumps into MediaWiki
--- El dom, 8/3/09, O. O. olson...@yahoo.com escribió: I thought that the pages-articles.xml.bz2 (i.e. the XML Dump) contains the templates – but I did not find a way to do install it separately. No, it only contains a dump of the current version of each article (involving the page, revision and text tables in the DB). Another thing I noticed (with the Portuguese Wiki which is a much smaller dump than the English Wiki) is that the number of pages imported by importDump.php and MWDumper differ i.e. importDump.php had much more pages than MWDumper. That is way I would have preferred to do this using importDump.php. On download.wikimedia.org/your_lang_here you can check how many pages were supposed to be included in each dump. You also have other parsers you may want to check (in my experience, my parser was slightly faster than mwdumper): http://meta.wikimedia.org/wiki/WikiXRay_Python_parser Also in a previous post, you mentioned about taking care about the “secondary link tables”. How do I do that? Does “secondary links” refer to language links, external links, template links, image links, category links, page links or something else? On the same page for downloads you have a list of additional dumps in SQL format (then compressed with gzip). I guess you may also want to import them (but of course, you don't need a parser for them, they can be directly loaded in the DB). Best, F. Thanks for your patience O.O. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Importing Wikipedia XML Dumps into MediaWiki
O. O. schrieb: Daniel Kinzler wrote: That sounds very *very* odd. because page content is imported as-is in both cases, it's not processed in any way. The only thing I can imagine is that things don't look right if you don't have all the templates imported yet. Thanks Daniel. Yes, I think that this may be because the Templates are not imported. (Get a lot of Template: ...). Any suggestions on how to import the templates? I thought that the pages-articles.xml.bz2 (i.e. the XML Dump) contains the templates – but I did not find a way to do install it separately. They should be contained. As it sais on the download page: Articles, templates, image descriptions, and primary meta-pages. Another thing I noticed (with the Portuguese Wiki which is a much smaller dump than the English Wiki) is that the number of pages imported by importDump.php and MWDumper differ i.e. importDump.php had much more pages than MWDumper. That is way I would have preferred to do this using importDump.php. The number of pages should be the same. soudns to me that the import with mwdumper was simply incomplete. Any error messages? Also in a previous post, you mentioned about taking care about the “secondary link tables”. How do I do that? Does “secondary links” refer to language links, external links, template links, image links, category links, page links or something else? THis is exactly it. YOu can rebuild them using the rebuildAll.php maintenance script (or was it refreshAll? something like that). But that takes *very* long to run, and might result in the same memory problem you experienced before. The alternative is to download dumps of these tables and improt them into mysql directly. They are available from the download site. -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Importing Wikipedia XML Dumps into MediaWiki
Felipe Ortega wrote: --- El dom, 8/3/09, O. O. olson...@yahoo.com escribió: I thought that the pages-articles.xml.bz2 (i.e. the XML Dump) contains the templates – but I did not find a way to do install it separately. No, it only contains a dump of the current version of each article (involving the page, revision and text tables in the DB). Thanks Felipe for posting. pages-articles.xml.bz2 as mentioned at http://download.wikimedia.org/enwiki/20081008/ Says that it is “Articles, templates, image descriptions, and primary meta-pages.” What does “templates” mean if it does not contain the templates?? Another thing I noticed (with the Portuguese Wiki which is a much smaller dump than the English Wiki) is that the number of pages imported by importDump.php and MWDumper differ i.e. importDump.php had much more pages than MWDumper. That is way I would have preferred to do this using importDump.php. On download.wikimedia.org/your_lang_here you can check how many pages were supposed to be included in each dump. You also have other parsers you may want to check (in my experience, my parser was slightly faster than mwdumper): http://meta.wikimedia.org/wiki/WikiXRay_Python_parser Here my concern is not about speed – but about integrity. I don’t mind the import taking too long – as long as it completes. I used importDump.php because it was listed as the “Recommended way” of importing. But now I realize that no one has used it on a real Wikipedia Dump. Nonetheless, I would give your tool a try sometime over the next two weeks or so. Also in a previous post, you mentioned about taking care about the “secondary link tables”. How do I do that? Does “secondary links” refer to language links, external links, template links, image links, category links, page links or something else? On the same page for downloads you have a list of additional dumps in SQL format (then compressed with gzip). I guess you may also want to import them (but of course, you don't need a parser for them, they can be directly loaded in the DB). Best, F. I have not tried these as yet. I would try them tomorrow and get back to you i.e. the newsgroup. Thanks again, O. O. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Importing Wikipedia XML Dumps into MediaWiki
Daniel Kinzler wrote: O. O. schrieb: I thought that the pages-articles.xml.bz2 (i.e. the XML Dump) contains the templates – but I did not find a way to do install it separately. They should be contained. As it sais on the download page: Articles, templates, image descriptions, and primary meta-pages. Thanks Daniel. I know that the templates are contained in pages-articles.xml.bz2. However as you said that Mwdumper may not be importing the templates, my question was how to do import it then? Another thing I noticed (with the Portuguese Wiki which is a much smaller dump than the English Wiki) is that the number of pages imported by importDump.php and MWDumper differ i.e. importDump.php had much more pages than MWDumper. That is way I would have preferred to do this using importDump.php. The number of pages should be the same. soudns to me that the import with mwdumper was simply incomplete. Any error messages? Actually was intending to start a separate thread on this topic – because both Mwdumper and importDump.php both report that they are Skipping certain pages. I did not note down the error that I received from Mwdumper – but the errors from importDump.php look like what is below. Skipping interwiki page title 'Page_Title' Anyway both have the word “Skipping …” as part of their error. I do not have the actual figures – but I noticed that importDump.php seemed to have more pages than Mwdumper. (I unfortunately did not save the output – so I cannot compare how many times I got these errors.) Also in a previous post, you mentioned about taking care about the “secondary link tables”. How do I do that? Does “secondary links” refer to language links, external links, template links, image links, category links, page links or something else? THis is exactly it. YOu can rebuild them using the rebuildAll.php maintenance script (or was it refreshAll? something like that). But that takes *very* long to run, and might result in the same memory problem you experienced before. Yes, the script is called rebuildall.php and mentioned in http://www.mediawiki.org/wiki/Manual:Importing_XML_dumps#Using_mwdumper – As you mentioned I was expecting memory problems with this too since importDump.php is already having memory issues. The alternative is to download dumps of these tables and improt them into mysql directly. They are available from the download site. -- daniel I would try to import the Tables tomorrow to see what I get. Thanks again, O. O. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Importing Wikipedia XML Dumps into MediaWiki
Thanks Joshua. I would prefer that you post to the Mailing List / Newsgroup – so that all can benefit from your ideas. --- El dom 8-mar-09, Joshua C. Lerner jler...@gmail.com escribió: De: Joshua C. Lerner jler...@gmail.com Asunto: Re: [Wikitech-l] Importing Wikipedia XML Dumps into MediaWiki Just for kicks I decided to try to do an import of ptwiki - using what I learned in bringing up mirrors of 4 Greek and 3 English Wikimedia sites, including Greek Wikipedia. Basically I had the best luck with Xml2sql (http://meta.wikimedia.org/wiki/Xml2sql). The conversion from XML to SQL went smoothly: # ./xml2sql /mnt/pt/ptwiki-20090128-pages-articles.xml As did the import: # mysqlimport -u root -p --local pt ./{page,revision,text}.txt Enter password: pt.page: Records: 1044220 Deleted: 0 Skipped: 0 Warnings: 0 pt.revision: Records: 1044220 Deleted: 0 Skipped: 0 Warnings: 3 pt.text: Records: 1044220 Deleted: 0 Skipped: 0 Warnings: 0 I'm running maintenance/rebuildall.php at the moment: # php rebuildall.php ** Rebuilding fulltext search index (if you abort this will break searching; run this script again to fix): Dropping index... Rebuilding index fields for 2119470 pages... 442500 (still running) I'll send a note to the list with the results of this experiment. Let me know if you need additional information or help. Are you trying to set up any mirrors? Joshua Thanks for making this attempt. Let me know if your rebuildall.php has memory issues. This is really getting confusing for me – because there are so many ways – all of which guaranteed to work – that work, and the one that is recommended – does not seem to work. I would try out your approach too – but it would take time as I only have one computer to spare. Thanks, O.o. ¡Sé el Bello 51 de People en Español! ¡Es tu oportunidad de Brillar! Sube tus fotos ya. http://www.51bello.com/ ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Importing Wikipedia XML Dumps into MediaWiki
Platonides schrieb: O. Olson wrote: Does anyone have experience importing the Wikipedia XML Dumps into MediaWiki. I made an attempt with the English Wiki Dump as well as the Portuguese Wiki Dump, giving php (cli) 1024 MB of Memory in the php.ini file. Both of these attempts fail with out of memory errors. Don't use importDump.php for a whole wiki dump, use MWDumper http://www.mediawiki.org/wiki/MWDumper MWDumper doesn't fill the secondary link tables. Please see http://www.mediawiki.org/wiki/Manual:Importing_XML_dumps for detailed instructions and considerations. Also keep in mind that the english wikipedia is *huge*. You will need a decent database server to be able to process it. I wouldn't even try on a desktop/laptop. -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Importing Wikipedia XML Dumps into MediaWiki
Platonides wrote: Don't use importDump.php for a whole wiki dump, use MWDumper http://www.mediawiki.org/wiki/MWDumper Thanks Platonides. I am just curious why does http://www.mediawiki.org/wiki/Manual:Importing_XML_dumps#Using_importDump.php say that importDump.php is the recommended method for imports. You need to understand that this page does warn that the import of large dumps would be slow. My concern here is not about the “slowness” but the fact that this crashes with an Out Of Memory Error. I can give PHP more memory – but the usage just seems to grow over time. Is this the correct place to ask such questions? Or are there better places O. O. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Importing Wikipedia XML Dumps into MediaWiki
Daniel Kinzler wrote: Platonides schrieb: MWDumper doesn't fill the secondary link tables. Please see http://www.mediawiki.org/wiki/Manual:Importing_XML_dumps for detailed instructions and considerations. Also keep in mind that the english wikipedia is *huge*. You will need a decent database server to be able to process it. I wouldn't even try on a desktop/laptop. -- daniel Thanks Daniel. I have tried MWDumper and the results seem different from importDump.php i.e. the Formatting is messed up. In tracking down what I might be doing wrong - I would prefer to do this using the native method. Secondly, my question here is regarding PHP – not about the Database. I don’t see how a memory leak in PHP can be caused by the Database. Has anyone had practical experience with importDump.php? Did you face any memory issues? Thanks O. O. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Importing Wikipedia XML Dumps into MediaWiki
Is this on MW older than 1.14? You may want to disable profiling if it is on. -Aaron -- From: O. O. olson...@yahoo.com Sent: Saturday, March 07, 2009 10:28 PM To: wikitech-l@lists.wikimedia.org Subject: Re: [Wikitech-l] Importing Wikipedia XML Dumps into MediaWiki Platonides wrote: Don't use importDump.php for a whole wiki dump, use MWDumper http://www.mediawiki.org/wiki/MWDumper Thanks Platonides. I am just curious why does http://www.mediawiki.org/wiki/Manual:Importing_XML_dumps#Using_importDump.php say that importDump.php is the recommended method for imports. You need to understand that this page does warn that the import of large dumps would be slow. My concern here is not about the “slowness” but the fact that this crashes with an Out Of Memory Error. I can give PHP more memory – but the usage just seems to grow over time. Is this the correct place to ask such questions? Or are there better places O. O. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Importing Wikipedia XML Dumps into MediaWiki
Jason Schulz wrote: Is this on MW older than 1.14? You may want to disable profiling if it is on. -Aaron Thanks Jason/Aaron. No, this is the recent MW 1.14 – downloaded in the beginning of this week from http://www.mediawiki.org/wiki/Download. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
[Wikitech-l] Importing Wikipedia XML Dumps into MediaWiki
Hi, I am not sure if this is the correct place to ask this – if not then please let me know which is the best place for such a question. Does anyone have experience importing the Wikipedia XML Dumps into MediaWiki. I made an attempt with the English Wiki Dump as well as the Portuguese Wiki Dump, giving php (cli) 1024 MB of Memory in the php.ini file. Both of these attempts fail with out of memory errors. I am using the lasted version of MediaWiki 1.14.0 and PHP 5.2.6-1+lenny2 with Suhosin-Patch 0.9.6.2 (cli) (built: Jan 26 2009 22:41:04). Does anyone have experience with this import and how to avoid the memory errors? I can give it more memory – but it seems to be leaking memory over time. Thanks again, O. O. ¡Sé el Bello 51 de People en Español! ¡Es tu oportunidad de Brillar! Sube tus fotos ya. http://www.51bello.com/ ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l