Re: [Wikitech-l] Importing Wikipedia XML Dumps into MediaWiki

2009-03-23 Thread O. O.
Hi,
I hate to be resurrecting an old thread, but I think for the purpose of 
completion I would like to post my experience with the Import of XML 
Dumps of Wikipedia into Mediawiki, so that it would help someone else 
looking for this information. I started this thread after all.

I was attempting to import the XML/SQL dumps of the English Wikipedia 
http://download.wikimedia.org/enwiki/20081008/ (not the most recent 
version) using the three methods described at 
http://www.mediawiki.org/wiki/Manual:Importing_XML_dumps

I.  Using importDump.php:
While this is the recommended method, I did run into memory issues. The 
PHP (CLI) runs out of memory after a day or two, and then you have to 
restart the import. (The good thing is that it skips quickly over pages 
it is already imported after the restart.) However the fact that this 
crashed too many times made me give up on it.

II. Using mwdumper:
This is actually pretty fast, and does not give errors. However I could 
not figure out why this imports only 6.1 Million Pages, as compared to 
7.6 Millon pages in the dump mentioned above (not the most recent dump.) 
The command line output correctly indicates that 7.6 M pages have been 
processed – but when you count the entries in the page table, only 6.1M 
show up. I don’t know what happens to the rest, because as far as I can 
see there were no errors.

III.Using xml2sql:
Actually this is not the recommended way of importing the XML dumps 
according to http://meta.wikimedia.org/wiki/Xml2sql - but it is the only 
way that really worked for me. However as compared to the other tools, 
this needs to be compiled/installed to get it to work. As Joshua 
suggested a simple:
$   xml2sql enwiki-20081008-pages-articles.xml
$  mysqlimport -u root -p --local wikidb ./{page,revision,text}.txt

worked for me.

Notes: Your local MediaWiki will still not look like the online wiki 
(even after you take into account that Images do not come with these 
dumps).
1.  For that I first imported the SQL Dumps into the other tables that 
were available at http://download.wikimedia.org/enwiki/20081008/ (except 
page – since you have already imported it by now.)
2.  I next installed the extensions listed in the “Parser hooks” section 
under “Installed extensions” on 
http://en.wikipedia.org/wiki/Special:Version
3.  Finally, I recommend that you use HTML Tidy, because even after the 
above steps, the output is screwed up. The settings for HTML Tidy are in 
the LocalSettings.php. These are not there by default, you need to get 
them from includes/DefaultSettings.php. The settings that worked for me 
were:
$wgUseTidy = true;
$wgAlwaysUseTidy = false;
$wgTidyBin = '/usr/bin/tidy';
$wgTidyConf = $IP.'/includes/tidy.conf';
$wgTidyOpts = '';
$wgTidyInternal = extension_loaded( 'tidy' );

And

$wgValidateAllHtml = false;

Ensure this last one is false - else you would get nothing for most of 
the pages.

I hope the above information helps others who also want to Import of XML 
Dumps of Wikipedia into Mediawiki.

Thanks to all who answered my posts,
O. O.


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Importing Wikipedia XML Dumps into MediaWiki

2009-03-13 Thread O. O.
Mohamed Magdy wrote:
 I don't remember if I already mentioned this: you can split the xml
 file * into smaller pieces then import it using importDump.php.
 
 Use a loop to make a file like this and then run it:
 #!/bin/bash
 php maintenance/importDump.php  /path/pagexml.1
 wait
 php maintenance/importDump.php  /path/pagexml.2
 ...
 
 I haven't tried to start many php importDump.php processes working on
 different xml files simultaneously, will it work?
 
 * = 
 http://blog.prashanthellina.com/2007/10/17/ways-to-process-and-use-wikipedia-dumps/

Thanks Mohamed – This is a good suggestion, but I am a bit vary to try 
it, because if I later have problems, I would not be sure if it is 
because I used this script to split the XML files.

I understand that the script looks OK, in that it simply splits the XML 
files at the “/page” boundaries – but I don’t know a lot on how this 
would effect the final result.

Thanks again,

O. O.


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Importing Wikipedia XML Dumps into MediaWiki

2009-03-09 Thread O. O.
Thanks Joshua. I am intending to try two approaches. The first being to 
use the xml2sql and then fill the rest of the tables with the individual 
dumps of the Tables that are already provided in SQL. The second would 
be using Mwdumper – and then import the rest of the Tables using the SQL 
Dumps already provided to see if there are any differences.

Joshua C. Lerner wrote:
 Thanks for making this attempt. Let me know if your rebuildall.php has 
 memory issues.
 
 Seems fine - steady at 2.2% of memory available.
 
 This is really getting confusing for me – because there are so many ways – 
 all of which guaranteed to work – that work, and the one that is recommended 
 – does not seem to work.
 
 I think you mean all of which are *not* guaranteed to work.
 
 I would try out your approach too – but it would take time as I only have 
 one computer to spare.
 
 If you want I can just send you a database dump. Either now, or after
 rebuildall.php all finishes. Right now, it's now refreshing the links
 table, but only up to page_id 34,100 out of over 2 million pages.
 It'll be running for days.
 
 Joshua

Thanks for posting your experience with rebuildall.php.  I think I might 
be able to live with the bad syntax that I get – if I cannot manage to 
get this to work.
Thanks again,
O. O.



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Importing Wikipedia XML Dumps into MediaWiki

2009-03-08 Thread O. O.
Daniel Kinzler wrote:
 
 That sounds very *very* odd. because page content is imported as-is in both
 cases, it's not processed in any way. The only thing I can imagine is that
 things don't look right if you don't have all the templates imported yet.

Thanks Daniel. Yes, I think that this may be because the Templates are 
not imported. (Get a lot of Template: ...). Any suggestions on how to 
import the templates?

I thought that the pages-articles.xml.bz2 (i.e. the XML Dump) contains 
the templates – but I did not find a way to do install it separately.


Another thing I noticed (with the Portuguese Wiki which is a much 
smaller dump than the English Wiki) is that the number of pages imported 
by importDump.php and MWDumper differ i.e. importDump.php had much more 
pages than MWDumper. That is way I would have preferred to do this using 
  importDump.php.


Also in a previous post, you mentioned about taking care about the 
“secondary link tables”. How do I do that? Does “secondary links” refer 
to language links, external links, template links, image links, category 
links, page links or something else?

Thanks for your patience

O.O.


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Importing Wikipedia XML Dumps into MediaWiki

2009-03-08 Thread Felipe Ortega



--- El dom, 8/3/09, O. O. olson...@yahoo.com escribió:

     I thought that the
 pages-articles.xml.bz2 (i.e. the XML Dump) contains 
 the templates – but I did not find a way to do install it
 separately.
 

No, it only contains a dump of the current version of each article (involving 
the page, revision and text tables in the DB).

 
 Another thing I noticed (with the Portuguese Wiki which is
 a much 
 smaller dump than the English Wiki) is that the number of
 pages imported 
 by importDump.php and MWDumper differ i.e. importDump.php
 had much more 
 pages than MWDumper. That is way I would have preferred to
 do this using 
   importDump.php.
 

On download.wikimedia.org/your_lang_here you can check how many pages were 
supposed to be included in each dump.

You also have other parsers you may want to check (in my experience, my parser 
was slightly faster than mwdumper):
http://meta.wikimedia.org/wiki/WikiXRay_Python_parser

 
 Also in a previous post, you mentioned about taking care
 about the 
 “secondary link tables”. How do I do that? Does
 “secondary links” refer 
 to language links, external links, template links, image
 links, category 
 links, page links or something else?
 

On the same page for downloads you have a list of additional dumps in SQL 
format (then compressed with gzip). I guess you may also want to import them 
(but of course, you don't need a parser for them, they can be directly loaded 
in the DB).

Best,

F.

 Thanks for your patience
 
 O.O.
 
 
 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l
 


  

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Importing Wikipedia XML Dumps into MediaWiki

2009-03-08 Thread Daniel Kinzler
O. O. schrieb:
 Daniel Kinzler wrote:
 That sounds very *very* odd. because page content is imported as-is in both
 cases, it's not processed in any way. The only thing I can imagine is that
 things don't look right if you don't have all the templates imported yet.
 
 Thanks Daniel. Yes, I think that this may be because the Templates are 
 not imported. (Get a lot of Template: ...). Any suggestions on how to 
 import the templates?
 
   I thought that the pages-articles.xml.bz2 (i.e. the XML Dump) contains 
 the templates – but I did not find a way to do install it separately.

They should be contained. As it sais on the download page: Articles, templates,
image descriptions, and primary meta-pages.

 Another thing I noticed (with the Portuguese Wiki which is a much 
 smaller dump than the English Wiki) is that the number of pages imported 
 by importDump.php and MWDumper differ i.e. importDump.php had much more 
 pages than MWDumper. That is way I would have preferred to do this using 
   importDump.php.

The number of pages should be the same. soudns to me that the import with
mwdumper was simply incomplete. Any error messages?


 Also in a previous post, you mentioned about taking care about the 
 “secondary link tables”. How do I do that? Does “secondary links” refer 
 to language links, external links, template links, image links, category 
 links, page links or something else?

THis is exactly it. YOu can rebuild them using the rebuildAll.php maintenance
script (or was it refreshAll? something like that). But that takes *very* long
to run, and might result in the same memory problem you experienced before.

The alternative is to download dumps of these tables and improt them into mysql
directly. They are available from the download site.

-- daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Importing Wikipedia XML Dumps into MediaWiki

2009-03-08 Thread O. O.
Felipe Ortega wrote:
 
 
 --- El dom, 8/3/09, O. O. olson...@yahoo.com escribió:
 
 I thought that the
 pages-articles.xml.bz2 (i.e. the XML Dump) contains 
 the templates – but I did not find a way to do install it
 separately.

 
 No, it only contains a dump of the current version of each article (involving 
 the page, revision and text tables in the DB).

Thanks Felipe for posting.

pages-articles.xml.bz2 as mentioned at 
http://download.wikimedia.org/enwiki/20081008/ Says that it is 
“Articles, templates, image descriptions, and primary meta-pages.” What 
does “templates” mean if it does not contain the templates??

 
 Another thing I noticed (with the Portuguese Wiki which is
 a much 
 smaller dump than the English Wiki) is that the number of
 pages imported 
 by importDump.php and MWDumper differ i.e. importDump.php
 had much more 
 pages than MWDumper. That is way I would have preferred to
 do this using 
   importDump.php.

 
 On download.wikimedia.org/your_lang_here you can check how many pages were 
 supposed to be included in each dump.
 
 You also have other parsers you may want to check (in my experience, my 
 parser was slightly faster than mwdumper):
 http://meta.wikimedia.org/wiki/WikiXRay_Python_parser

Here my concern is not about speed – but about integrity. I don’t mind 
the import taking too long – as long as it completes. I used 
importDump.php because it was listed as the “Recommended way” of 
importing. But now I realize that no one has used it on a real Wikipedia 
Dump.

Nonetheless, I would give your tool a try sometime over the next two 
weeks or so.


 
 Also in a previous post, you mentioned about taking care
 about the 
 “secondary link tables”. How do I do that? Does
 “secondary links” refer 
 to language links, external links, template links, image
 links, category 
 links, page links or something else?

 
 On the same page for downloads you have a list of additional dumps in SQL 
 format (then compressed with gzip). I guess you may also want to import them 
 (but of course, you don't need a parser for them, they can be directly loaded 
 in the DB).
 
 Best,
 
 F.
 

I have not tried these as yet. I would try them tomorrow and get back to 
you i.e. the newsgroup.

Thanks again,
O. O.


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Importing Wikipedia XML Dumps into MediaWiki

2009-03-08 Thread O. O.
Daniel Kinzler wrote:
 O. O. schrieb:

  I thought that the pages-articles.xml.bz2 (i.e. the XML Dump) contains 
 the templates – but I did not find a way to do install it separately.
 
 They should be contained. As it sais on the download page: Articles, 
 templates,
 image descriptions, and primary meta-pages.

Thanks Daniel. I know that the templates are contained in 
pages-articles.xml.bz2. However as you said that Mwdumper may not be 
importing the templates, my question was how to do import it then?
 
 Another thing I noticed (with the Portuguese Wiki which is a much 
 smaller dump than the English Wiki) is that the number of pages imported 
 by importDump.php and MWDumper differ i.e. importDump.php had much more 
 pages than MWDumper. That is way I would have preferred to do this using 
   importDump.php.
 
 The number of pages should be the same. soudns to me that the import with
 mwdumper was simply incomplete. Any error messages?
 

Actually was intending to start a separate thread on this topic – 
because both Mwdumper and importDump.php both report that they are 
Skipping certain pages. I did not note down the error that I received 
from Mwdumper – but the errors from importDump.php look like what is below.

Skipping interwiki page title 'Page_Title'

Anyway both have the word “Skipping …” as part of their error. I do not 
have the actual figures – but I noticed that importDump.php seemed to 
have more pages than Mwdumper. (I unfortunately did not save the output 
– so I cannot compare how many times I got these errors.)


 Also in a previous post, you mentioned about taking care about the 
 “secondary link tables”. How do I do that? Does “secondary links” refer 
 to language links, external links, template links, image links, category 
 links, page links or something else?
 
 THis is exactly it. YOu can rebuild them using the rebuildAll.php maintenance
 script (or was it refreshAll? something like that). But that takes *very* long
 to run, and might result in the same memory problem you experienced before.

Yes, the script is called rebuildall.php and mentioned in 
http://www.mediawiki.org/wiki/Manual:Importing_XML_dumps#Using_mwdumper 
– As you mentioned I was expecting memory problems with this too since 
importDump.php is already having memory issues.

 
 The alternative is to download dumps of these tables and improt them into 
 mysql
 directly. They are available from the download site.
 
 -- daniel

I would try to import the Tables tomorrow to see what I get.


Thanks again,

O. O.


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Importing Wikipedia XML Dumps into MediaWiki

2009-03-08 Thread O. Olson

Thanks Joshua. I would prefer that you post to the Mailing List / Newsgroup – 
so that all can benefit from your ideas. 

--- El dom 8-mar-09, Joshua C. Lerner jler...@gmail.com escribió:

 De: Joshua C. Lerner jler...@gmail.com
 Asunto: Re: [Wikitech-l] Importing Wikipedia XML Dumps into MediaWiki
 
 Just for kicks I decided to try to do an import of ptwiki -
 using what
 I learned in bringing up mirrors of 4 Greek and 3 English
 Wikimedia
 sites, including Greek Wikipedia. Basically I had the best
 luck with
 Xml2sql (http://meta.wikimedia.org/wiki/Xml2sql). The
 conversion from
 XML to SQL went smoothly:
 
 # ./xml2sql /mnt/pt/ptwiki-20090128-pages-articles.xml
 
 As did the import:
 
 # mysqlimport -u root -p --local pt
 ./{page,revision,text}.txt
 Enter password:
 pt.page: Records: 1044220  Deleted: 0  Skipped: 0 
 Warnings: 0
 pt.revision: Records: 1044220  Deleted: 0  Skipped: 0 
 Warnings: 3
 pt.text: Records: 1044220  Deleted: 0  Skipped: 0 
 Warnings: 0
 
 I'm running maintenance/rebuildall.php at the moment:
 
 # php rebuildall.php
 ** Rebuilding fulltext search index (if you abort this will
 break
 searching; run this script again to fix):
 Dropping index...
 Rebuilding index fields for 2119470 pages...
 442500
 
 (still running)
 
 I'll send a note to the list with the results of this
 experiment. Let
 me know if you need additional information or help. Are you
 trying to
 set up any mirrors?
 
 Joshua


Thanks for making this attempt. Let me know if your rebuildall.php has memory 
issues. 

This is really getting confusing for me – because there are so many ways – all 
of which guaranteed to work – that work, and the one that is recommended – does 
not seem to work. 

I would try out your approach too – but it would take time as I only have one 
computer to spare. 

Thanks,
O.o. 



  ¡Sé el Bello 51 de People en Español! ¡Es tu oportunidad de Brillar! Sube 
tus fotos ya. http://www.51bello.com/

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Importing Wikipedia XML Dumps into MediaWiki

2009-03-07 Thread Daniel Kinzler
Platonides schrieb:
 O. Olson wrote:
 Does anyone have experience importing the Wikipedia XML Dumps into
 MediaWiki. I made an attempt with the English Wiki Dump as well as the
 Portuguese Wiki Dump, giving php (cli) 1024 MB of Memory in the php.ini
 file. Both of these attempts fail with out of memory errors.

 Don't use importDump.php for a whole wiki dump, use MWDumper 
 http://www.mediawiki.org/wiki/MWDumper

MWDumper doesn't fill the secondary link tables. Please see
http://www.mediawiki.org/wiki/Manual:Importing_XML_dumps for detailed
instructions and considerations.

Also keep in mind that the english wikipedia is *huge*. You will need a decent
database server to be able to process it. I wouldn't even try on a 
desktop/laptop.

-- daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Importing Wikipedia XML Dumps into MediaWiki

2009-03-07 Thread O. O.
Platonides wrote:
 
 Don't use importDump.php for a whole wiki dump, use MWDumper
 http://www.mediawiki.org/wiki/MWDumper
 

Thanks Platonides. I am just curious why does 
http://www.mediawiki.org/wiki/Manual:Importing_XML_dumps#Using_importDump.php 
  say that importDump.php is the recommended method for imports.

You need to understand that this page does warn that the import of large 
dumps would be slow. My concern here is not about the “slowness” but 
the fact that this crashes with an Out Of Memory Error. I can give PHP 
more memory – but the usage just seems to grow over time.

Is this the correct place to ask such questions? Or are there better places

O. O.




___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Importing Wikipedia XML Dumps into MediaWiki

2009-03-07 Thread O. O.
Daniel Kinzler wrote:
 Platonides schrieb:
   MWDumper doesn't fill the secondary link tables. Please see
 http://www.mediawiki.org/wiki/Manual:Importing_XML_dumps for detailed
 instructions and considerations.
 
 Also keep in mind that the english wikipedia is *huge*. You will need a decent
 database server to be able to process it. I wouldn't even try on a 
 desktop/laptop.
 
 -- daniel


Thanks Daniel. I have tried MWDumper and the results seem different from 
importDump.php i.e. the Formatting is messed up. In tracking down what I 
might be doing wrong - I would prefer to do this using the native method.

Secondly, my question here is regarding PHP – not about the Database. I 
don’t see how a memory leak in PHP can be caused by the Database.

Has anyone had practical experience with importDump.php? Did you face 
any memory issues?

Thanks
  O. O.


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Importing Wikipedia XML Dumps into MediaWiki

2009-03-07 Thread Jason Schulz
Is this on MW older than 1.14? You may want to disable profiling if it is 
on.

-Aaron

--
From: O. O.  olson...@yahoo.com
Sent: Saturday, March 07, 2009 10:28 PM
To: wikitech-l@lists.wikimedia.org
Subject: Re: [Wikitech-l] Importing Wikipedia XML Dumps into MediaWiki

 Platonides wrote:

 Don't use importDump.php for a whole wiki dump, use MWDumper
 http://www.mediawiki.org/wiki/MWDumper


 Thanks Platonides. I am just curious why does
 http://www.mediawiki.org/wiki/Manual:Importing_XML_dumps#Using_importDump.php
  say that importDump.php is the recommended method for imports.

 You need to understand that this page does warn that the import of large
 dumps would be slow. My concern here is not about the “slowness” but
 the fact that this crashes with an Out Of Memory Error. I can give PHP
 more memory – but the usage just seems to grow over time.

 Is this the correct place to ask such questions? Or are there better 
 places

 O. O.




 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l 


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Importing Wikipedia XML Dumps into MediaWiki

2009-03-07 Thread O. O.
Jason Schulz wrote:
 Is this on MW older than 1.14? You may want to disable profiling if it is 
 on.
 
 -Aaron
 
Thanks Jason/Aaron. No, this is the recent MW 1.14 – downloaded in the 
beginning of this week from http://www.mediawiki.org/wiki/Download.


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] Importing Wikipedia XML Dumps into MediaWiki

2009-03-06 Thread O. Olson

Hi,

I am not sure if this is the correct place to ask this – if not then please 
let me know which is the best place for such a question.

Does anyone have experience importing the Wikipedia XML Dumps into 
MediaWiki. I made an attempt with the English Wiki Dump as well as the 
Portuguese Wiki Dump, giving php (cli) 1024 MB of Memory in the php.ini file. 
Both of these attempts fail with out of memory errors.

I am using the lasted version of MediaWiki 1.14.0 and PHP 5.2.6-1+lenny2 
with Suhosin-Patch 0.9.6.2 (cli) (built: Jan 26 2009 22:41:04).

Does anyone have experience with this import and how to avoid the memory 
errors? I can give it more memory – but it seems to be leaking memory over time.

Thanks again,
O. O.



  ¡Sé el Bello 51 de People en Español! ¡Es tu oportunidad de Brillar! Sube 
tus fotos ya. http://www.51bello.com/

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l