Re: [Wikitech-l] Importing Wikipedia XML Dumps into MediaWiki
Mohamed Magdy wrote: I don't remember if I already mentioned this: you can split the xml file * into smaller pieces then import it using importDump.php. Use a loop to make a file like this and then run it: #!/bin/bash php maintenance/importDump.php /path/pagexml.1 wait php maintenance/importDump.php /path/pagexml.2 ... I haven't tried to start many php importDump.php processes working on different xml files simultaneously, will it work? * = http://blog.prashanthellina.com/2007/10/17/ways-to-process-and-use-wikipedia-dumps/ Thanks Mohamed – This is a good suggestion, but I am a bit vary to try it, because if I later have problems, I would not be sure if it is because I used this script to split the XML files. I understand that the script looks OK, in that it simply splits the XML files at the “/page” boundaries – but I don’t know a lot on how this would effect the final result. Thanks again, O. O. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
[Wikitech-l] HTML not Rendered correctly after Import of Wikipedia
Hi, I attempted to import the English Wikipedia into MediaWiki by first downloading the pages-articles.xml.bz2, uncompressing it, splitting it using xml2sql enwiki-20081008-pages-articles.xml and finally imported the results using mysqlimport -u root -p --local wikidb ./{page,revision,text}.txt I also imported all of the SQL files on http://download.wikimedia.org/enwiki/20081008/ The problem that I am now facing is that the HTML Rendered is wrong in places. Mostly this happens at the beginning of the text on the Page. For example in the beginning the United_Kingdom article I get: /trtr th colspan=2Calling code/th td+44 /td /tr/table After this I get the normal article text i.e. “The United Kingdom of … “ etc. The result of this is that the rest of the article is not formatted correctly. For example in IE the first paragraph is shifted into a column on the Right. In both IE and Mozilla I do not get the “navigation”, “search”, “interaction”, “toolbox”, languages” and the Sunflower MediaWiki Picture on the Top-Left Corner. (I get these elements in other pages though. I just wanted to illustrate the problems that bad HTML causes.) Another problem I am having is at the top of each page I get “Error: image is invalid or non-existent” – Is there a way to disable this error message. I know that I don’t have the images – and is not a problem for me. I would only prefer not to have this error message in red at the top of the Article. Any ideas on what I might be doing wrong here? Thanks, O. O. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
[Wikitech-l] Understanding the meaning of “Lis t of page titles”
Hi, I am looking at the dump of the English Wikipedia at http://download.wikimedia.org/enwiki/20081008/ There is a file called “all-titles-in-ns0.gz” which is supposed to contain the List of Page Titles. If I do cat enwiki-20081008-all-titles-in-ns0 | wc -l I get 5716820. On the same page, a little above in “pages-articles.xml.bz2” we have “enwiki 7649051 pages”. So why are these two numbers different? Are there pages without a Title? Thanks a lot, O. O. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Understanding the meaning of “List of page titles”
On Fri, Mar 13, 2009 at 2:44 PM, O. O. olson...@yahoo.com wrote: Hi, I am looking at the dump of the English Wikipedia at http://download.wikimedia.org/enwiki/20081008/ There is a file called “all-titles-in-ns0.gz” which is supposed to contain the List of Page Titles. If I do cat enwiki-20081008-all-titles-in-ns0 | wc -l I get 5716820. On the same page, a little above in “pages-articles.xml.bz2” we have “enwiki 7649051 pages”. The description for pages-articles.xml.bz2 says it contains Articles, templates, image descriptions, and primary meta-pages. all-titles-in-ns0.gz contains (as the name suggests) only the titles in ns0, i.e., the main namespace, articles. It does not contain templates, image descriptions, or primary meta-pages (whatever those are). ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] how to delete Talk:Project_talk:Community Portal?
Aryeh Gregor wrote: On Thu, Mar 12, 2009 at 3:58 PM, jida...@jidanni.org wrote: And how did you create it when it's illegal? Usually this happens when namespace names change, so that formerly it didn't start with a namespace prefix but now it does. Since both namespace names in this case are canonical and will always exist, I'm not sure how it came to be. Pages named Talk:NS:Foo, where NS is a valid namespace name, used to be allowed until fairly recently, but they never worked particularly consistently. See https://bugzilla.wikimedia.org/show_bug.cgi?id=5280 for details. -- Ilmari Karonen ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Understanding the meaning of “List of page titles”
Aryeh Gregor wrote: On Fri, Mar 13, 2009 at 2:44 PM, O. O. olson...@yahoo.com wrote: Hi, I am looking at the dump of the English Wikipedia at http://download.wikimedia.org/enwiki/20081008/ There is a file called “all-titles-in-ns0.gz” which is supposed to contain the List of Page Titles. If I do cat enwiki-20081008-all-titles-in-ns0 | wc -l I get 5716820. On the same page, a little above in “pages-articles.xml.bz2” we have “enwiki 7649051 pages”. The description for pages-articles.xml.bz2 says it contains Articles, templates, image descriptions, and primary meta-pages. all-titles-in-ns0.gz contains (as the name suggests) only the titles in ns0, i.e., the main namespace, articles. It does not contain templates, image descriptions, or primary meta-pages (whatever those are). Thanks Ilmari and Aryeh. I am not sure what are “primary meta-pages” – however “templates”, and “image descriptions” do have Titles. You can check this in the online version of the English Wikipedia. O. O. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Understanding the meaning of “List of page titles”
O. O. schrieb: Aryeh Gregor wrote: On Fri, Mar 13, 2009 at 2:44 PM, O. O. olson...@yahoo.com wrote: Hi, I am looking at the dump of the English Wikipedia at http://download.wikimedia.org/enwiki/20081008/ There is a file called “all-titles-in-ns0.gz” which is supposed to contain the List of Page Titles. If I do cat enwiki-20081008-all-titles-in-ns0 | wc -l I get 5716820. On the same page, a little above in “pages-articles.xml.bz2” we have “enwiki 7649051 pages”. The description for pages-articles.xml.bz2 says it contains Articles, templates, image descriptions, and primary meta-pages. all-titles-in-ns0.gz contains (as the name suggests) only the titles in ns0, i.e., the main namespace, articles. It does not contain templates, image descriptions, or primary meta-pages (whatever those are). Thanks Ilmari and Aryeh. I am not sure what are “primary meta-pages” – however “templates”, and “image descriptions” do have Titles. You can check this in the online version of the English Wikipedia. Sure they have titles. But they are not ns0 and thus not contained in this list. Wich is ns0 only (that is, main article namespace). -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] research-oriented toolserver?
Hi all, Judging by the replies we think we've failed to communicate clearly some of the ideas we wanted to put forward, and we'd like to take the opportunity to try to clear that up. We did not want to narrow this down to be only about a third party toolserver. Before we initiated contact we noticed the need for adding more resources to the existing cluster. Therefore we also had in mind the idea of augmenting the toolserver, rather than attempt to create a competitor for it. For instance this could help allow the toolserver to also host applications requiring some amounts of text crunching, which is currently not feasible as far as we can tell. Additionally we think there could perhaps be two paths to account creation, one for Wikipedians and one for researchers, with the research path laid out with clearer documentation on the requirements projects would need to fit the toolserver and what the application should contain, which combined with faster feedback would aid to make the process easier for the researchers. We hope that this clears up some central points in our ideas surrounding a research oriented toolserver. Currently we are exploring several ideas and this particular one might not become more than a thought and a thread on a mailing list. Nonetheless perhaps there are thoughts here that can become more solid somewhere down the line. Morten Warncke-Wang, Research Assistant John Riedl, Professor GroupLens Research www.grouplens.org ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Understanding the meaning of “List of page titles”
Daniel Kinzler wrote: O. O. schrieb: Aryeh Gregor wrote: On Fri, Mar 13, 2009 at 2:44 PM, O. O. olson...@yahoo.com wrote: Hi, I am looking at the dump of the English Wikipedia at http://download.wikimedia.org/enwiki/20081008/ There is a file called “all-titles-in-ns0.gz” which is supposed to contain the List of Page Titles. If I do cat enwiki-20081008-all-titles-in-ns0 | wc -l I get 5716820. On the same page, a little above in “pages-articles.xml.bz2” we have “enwiki 7649051 pages”. The description for pages-articles.xml.bz2 says it contains Articles, templates, image descriptions, and primary meta-pages. all-titles-in-ns0.gz contains (as the name suggests) only the titles in ns0, i.e., the main namespace, articles. It does not contain templates, image descriptions, or primary meta-pages (whatever those are). Thanks Ilmari and Aryeh. I am not sure what are “primary meta-pages” – however “templates”, and “image descriptions” do have Titles. You can check this in the online version of the English Wikipedia. Sure they have titles. But they are not ns0 and thus not contained in this list. Wich is ns0 only (that is, main article namespace). -- daniel Thanks Daniel. I had not understood the meaning of NS0. Anyway I found the details of NS0 from http://en.wikipedia.org/wiki/Wikipedia:NS0 However this confuses me even more. The above link says that “only articles” and no redirects are in the namespace NS0. Also Talk: pages are not included in the NS0. Then, when the current English Wikipedia advertises 2,791,033 Articles, I cannot understand why the list of Titles contains 5716820 Titles? This is a little more than double? Thanks for helping out, O. O. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Understanding the meaning of “List of page titles”
On Sat, Mar 14, 2009 at 9:26 AM, O. O. olson...@yahoo.com wrote: The above link says that “only articles” and no redirects are in the namespace NS0. Also Talk: pages are not included in the NS0. Then, when the current English Wikipedia advertises 2,791,033 Articles, I cannot understand why the list of Titles contains 5716820 Titles? This is a little more than double? The larger number includes redirects, the smaller number doesn't. -- Andrew Garrett ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Understanding the meaning of “List of page titles”
Andrew Garrett wrote: On Sat, Mar 14, 2009 at 9:26 AM, O. O. olson...@yahoo.com wrote: The above link says that “only articles” and no redirects are in the namespace NS0. Also Talk: pages are not included in the NS0. Then, when the current English Wikipedia advertises 2,791,033 Articles, I cannot understand why the list of Titles contains 5716820 Titles? This is a little more than double? The larger number includes redirects, the smaller number doesn't. Then why does this http://en.wikipedia.org/wiki/Wikipedia:NS0 say that “Redirects” are not considered as Articles and hence are not in NS0? O.O. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Understanding the meaning of “List of page titles”
On Sat, Mar 14, 2009 at 9:34 AM, O. O. olson...@yahoo.com wrote: Andrew Garrett wrote: On Sat, Mar 14, 2009 at 9:26 AM, O. O. olson...@yahoo.com wrote: The above link says that “only articles” and no redirects are in the namespace NS0. Also Talk: pages are not included in the NS0. Then, when the current English Wikipedia advertises 2,791,033 Articles, I cannot understand why the list of Titles contains 5716820 Titles? This is a little more than double? The larger number includes redirects, the smaller number doesn't. Then why does this http://en.wikipedia.org/wiki/Wikipedia:NS0 say that “Redirects” are not considered as Articles and hence are not in NS0? It doesn't say that, it says Not all pages in the article namespace are considered to be articles, listing redirects as an example. -- Andrew Garrett Sent from: Sydney New South Wales Australia. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
[Wikitech-l] not all tables need to be backed up
Gentlemen, it occurred to me that under close examination one finds that when making a backup of one's wiki's database, some of the tables dumped have various degrees of temporariness, and thus though needing to be present in a proper dump, could perhaps be emptied of their values, saving much space in the SQL.bz2 etc. file produced. Looking at the mysqldump man page, one finds no perfect options to do so, so instead makes one's own script: $ mysqldump my_database| perl -nwle 'BEGIN{$dontdump=wiki_(objectcache|searchindex)} s/(^-- )(Dumping data for table `$dontdump`$)/$1NOT $2/; next if /^LOCK TABLES `$dontdump` WRITE;$/../^UNLOCK TABLES;$/; print;' Though not myself daring to make any recommendations on http://www.mediawiki.org/wiki/Manual:Backing_up_a_wiki#Tables I am still curious which tables can be emptied always, which can be emptied if one is willing to remember to run a maintenance script to resurrect their contents, etc. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l