Re: [Wikitech-l] Importing Wikipedia XML Dumps into MediaWiki

2009-03-13 Thread O. O.
Mohamed Magdy wrote:
 I don't remember if I already mentioned this: you can split the xml
 file * into smaller pieces then import it using importDump.php.
 
 Use a loop to make a file like this and then run it:
 #!/bin/bash
 php maintenance/importDump.php  /path/pagexml.1
 wait
 php maintenance/importDump.php  /path/pagexml.2
 ...
 
 I haven't tried to start many php importDump.php processes working on
 different xml files simultaneously, will it work?
 
 * = 
 http://blog.prashanthellina.com/2007/10/17/ways-to-process-and-use-wikipedia-dumps/

Thanks Mohamed – This is a good suggestion, but I am a bit vary to try 
it, because if I later have problems, I would not be sure if it is 
because I used this script to split the XML files.

I understand that the script looks OK, in that it simply splits the XML 
files at the “/page” boundaries – but I don’t know a lot on how this 
would effect the final result.

Thanks again,

O. O.


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] HTML not Rendered correctly after Import of Wikipedia

2009-03-13 Thread O. O.
Hi,
I attempted to import the English Wikipedia into MediaWiki by first 
downloading the pages-articles.xml.bz2, uncompressing it, splitting it 
using

xml2sql enwiki-20081008-pages-articles.xml

and finally imported the results using

mysqlimport -u root -p --local wikidb ./{page,revision,text}.txt

I also imported all of the SQL files on 
http://download.wikimedia.org/enwiki/20081008/


The problem that I am now facing is that the HTML Rendered is wrong in 
places. Mostly this happens at the beginning of the text on the Page. 
For example in the beginning the United_Kingdom article I get:

/trtr  th colspan=2Calling code/th  td+44  /td 
/tr/table

After this I get the normal article text i.e. “The United Kingdom of … “ 
etc.

The result of this is that the rest of the article is not formatted 
correctly. For example in IE the first paragraph is shifted into a 
column on the Right. In both IE and Mozilla I do not get the 
“navigation”, “search”, “interaction”, “toolbox”, languages” and the 
Sunflower MediaWiki Picture on the Top-Left Corner.  (I get these 
elements in other pages though. I just wanted to illustrate the problems 
that bad HTML causes.)

Another problem I am having is at the top of each page I get “Error: 
image is invalid or non-existent” – Is there a way to disable this error 
message. I know that I don’t have the images – and is not a problem for 
me. I would only prefer not to have this error message in red at the top 
of the Article.

Any ideas on what I might be doing wrong here?

Thanks,
O. O.


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] Understanding the meaning of “Lis t of page titles”

2009-03-13 Thread O. O.
Hi,
I am looking at the dump of the English Wikipedia at 
http://download.wikimedia.org/enwiki/20081008/ There is a file called 
“all-titles-in-ns0.gz” which is supposed to contain the List of Page 
Titles.  If I do

cat enwiki-20081008-all-titles-in-ns0 | wc -l

I get 5716820. On the same page, a little above in 
“pages-articles.xml.bz2” we have “enwiki 7649051 pages”.

So why are these two numbers different? Are there pages without a Title?

Thanks a lot,
O. O.


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Understanding the meaning of “List of page titles”

2009-03-13 Thread Aryeh Gregor
On Fri, Mar 13, 2009 at 2:44 PM, O. O. olson...@yahoo.com wrote:
 Hi,
        I am looking at the dump of the English Wikipedia at
 http://download.wikimedia.org/enwiki/20081008/ There is a file called
 “all-titles-in-ns0.gz” which is supposed to contain the List of Page
 Titles.  If I do

 cat enwiki-20081008-all-titles-in-ns0 | wc -l

 I get 5716820. On the same page, a little above in
 “pages-articles.xml.bz2” we have “enwiki 7649051 pages”.

The description for pages-articles.xml.bz2 says it contains Articles,
templates, image descriptions, and primary meta-pages.
all-titles-in-ns0.gz contains (as the name suggests) only the titles
in ns0, i.e., the main namespace, articles.  It does not contain
templates, image descriptions, or primary meta-pages (whatever those
are).

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] how to delete Talk:Project_talk:Community Portal?

2009-03-13 Thread Ilmari Karonen
Aryeh Gregor wrote:
 On Thu, Mar 12, 2009 at 3:58 PM,  jida...@jidanni.org wrote:

 And how did you create it when it's illegal?
 
 Usually this happens when namespace names change, so that formerly it
 didn't start with a namespace prefix but now it does.  Since both
 namespace names in this case are canonical and will always exist, I'm
 not sure how it came to be.

Pages named Talk:NS:Foo, where NS is a valid namespace name, used to 
be allowed until fairly recently, but they never worked particularly 
consistently.  See https://bugzilla.wikimedia.org/show_bug.cgi?id=5280 
for details.

-- 
Ilmari Karonen

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Understanding the meaning of “List of page titles”

2009-03-13 Thread O. O.
Aryeh Gregor wrote:
 On Fri, Mar 13, 2009 at 2:44 PM, O. O. olson...@yahoo.com wrote:
 Hi,
I am looking at the dump of the English Wikipedia at
 http://download.wikimedia.org/enwiki/20081008/ There is a file called
 “all-titles-in-ns0.gz” which is supposed to contain the List of Page
 Titles.  If I do

 cat enwiki-20081008-all-titles-in-ns0 | wc -l

 I get 5716820. On the same page, a little above in
 “pages-articles.xml.bz2” we have “enwiki 7649051 pages”.
 
 The description for pages-articles.xml.bz2 says it contains Articles,
 templates, image descriptions, and primary meta-pages.
 all-titles-in-ns0.gz contains (as the name suggests) only the titles
 in ns0, i.e., the main namespace, articles.  It does not contain
 templates, image descriptions, or primary meta-pages (whatever those
 are).

Thanks Ilmari and Aryeh.

I am not sure what are “primary meta-pages” – however “templates”, and 
“image descriptions” do have Titles. You can check this in the online 
version of the English Wikipedia.

O. O.



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Understanding the meaning of “List of page titles”

2009-03-13 Thread Daniel Kinzler
O. O. schrieb:
 Aryeh Gregor wrote:
 On Fri, Mar 13, 2009 at 2:44 PM, O. O. olson...@yahoo.com wrote:
 Hi,
I am looking at the dump of the English Wikipedia at
 http://download.wikimedia.org/enwiki/20081008/ There is a file called
 “all-titles-in-ns0.gz” which is supposed to contain the List of Page
 Titles.  If I do

 cat enwiki-20081008-all-titles-in-ns0 | wc -l

 I get 5716820. On the same page, a little above in
 “pages-articles.xml.bz2” we have “enwiki 7649051 pages”.
 The description for pages-articles.xml.bz2 says it contains Articles,
 templates, image descriptions, and primary meta-pages.
 all-titles-in-ns0.gz contains (as the name suggests) only the titles
 in ns0, i.e., the main namespace, articles.  It does not contain
 templates, image descriptions, or primary meta-pages (whatever those
 are).
 
 Thanks Ilmari and Aryeh.
 
   I am not sure what are “primary meta-pages” – however “templates”, and 
 “image descriptions” do have Titles. You can check this in the online 
 version of the English Wikipedia.

Sure they have titles. But they are not ns0 and thus not contained in this
list. Wich is ns0 only (that is, main article namespace).

-- daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] research-oriented toolserver?

2009-03-13 Thread Morten Warncke-Wang
Hi all,

Judging by the replies we think we've failed to communicate clearly
some of the ideas we wanted to put forward, and we'd like to take the
opportunity to try to clear that up.

We did not want to narrow this down to be only about a third party
toolserver.  Before we initiated contact we noticed the need for
adding more resources to the existing cluster.  Therefore we also had
in mind the idea of augmenting the toolserver, rather than attempt to
create a competitor for it.  For instance this could help allow the
toolserver to also host applications requiring some amounts of text
crunching, which is currently not feasible as far as we can tell.

Additionally we think there could perhaps be two paths to account
creation, one for Wikipedians and one for researchers, with the
research path laid out with clearer documentation on the requirements
projects would need to fit the toolserver and what the application
should contain, which combined with faster feedback would aid to make
the process easier for the researchers.

We hope that this clears up some central points in our ideas
surrounding a research oriented toolserver.  Currently we are
exploring several ideas and this particular one might not become more
than a thought and a thread on a mailing list.  Nonetheless perhaps
there are thoughts here that can become more solid somewhere down the
line.

Morten Warncke-Wang, Research Assistant
John Riedl, Professor
GroupLens Research
www.grouplens.org

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Understanding the meaning of “List of page titles”

2009-03-13 Thread O. O.
Daniel Kinzler wrote:
 O. O. schrieb:
 Aryeh Gregor wrote:
 On Fri, Mar 13, 2009 at 2:44 PM, O. O. olson...@yahoo.com wrote:
 Hi,
I am looking at the dump of the English Wikipedia at
 http://download.wikimedia.org/enwiki/20081008/ There is a file called
 “all-titles-in-ns0.gz” which is supposed to contain the List of Page
 Titles.  If I do

 cat enwiki-20081008-all-titles-in-ns0 | wc -l

 I get 5716820. On the same page, a little above in
 “pages-articles.xml.bz2” we have “enwiki 7649051 pages”.
 The description for pages-articles.xml.bz2 says it contains Articles,
 templates, image descriptions, and primary meta-pages.
 all-titles-in-ns0.gz contains (as the name suggests) only the titles
 in ns0, i.e., the main namespace, articles.  It does not contain
 templates, image descriptions, or primary meta-pages (whatever those
 are).
 Thanks Ilmari and Aryeh.

  I am not sure what are “primary meta-pages” – however “templates”, and 
 “image descriptions” do have Titles. You can check this in the online 
 version of the English Wikipedia.
 
 Sure they have titles. But they are not ns0 and thus not contained in this
 list. Wich is ns0 only (that is, main article namespace).
 
 -- daniel
 
Thanks  Daniel. I had not understood the meaning of NS0. Anyway I found 
the details of NS0 from http://en.wikipedia.org/wiki/Wikipedia:NS0 
However this confuses me even more.

The above link says that “only articles” and no redirects are in the 
namespace NS0. Also Talk: pages are not included in the NS0.
Then, when the current English Wikipedia advertises 2,791,033 Articles, 
I cannot understand why the list of Titles contains 5716820 Titles? This 
is a little more than double?

Thanks for helping out,
O. O.


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Understanding the meaning of “List of page titles”

2009-03-13 Thread Andrew Garrett
On Sat, Mar 14, 2009 at 9:26 AM, O. O. olson...@yahoo.com wrote:
        The above link says that “only articles” and no redirects are in the
 namespace NS0. Also Talk: pages are not included in the NS0.
 Then, when the current English Wikipedia advertises 2,791,033 Articles,
 I cannot understand why the list of Titles contains 5716820 Titles? This
 is a little more than double?

The larger number includes redirects, the smaller number doesn't.

-- 
Andrew Garrett

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Understanding the meaning of “List of page titles”

2009-03-13 Thread O. O.
Andrew Garrett wrote:
 On Sat, Mar 14, 2009 at 9:26 AM, O. O. olson...@yahoo.com wrote:
The above link says that “only articles” and no redirects are in the
 namespace NS0. Also Talk: pages are not included in the NS0.
 Then, when the current English Wikipedia advertises 2,791,033 Articles,
 I cannot understand why the list of Titles contains 5716820 Titles? This
 is a little more than double?
 
 The larger number includes redirects, the smaller number doesn't.
 
Then why does this http://en.wikipedia.org/wiki/Wikipedia:NS0 say that 
“Redirects” are not considered as Articles and hence are not in NS0?

O.O.


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Understanding the meaning of “List of page titles”

2009-03-13 Thread Andrew Garrett
On Sat, Mar 14, 2009 at 9:34 AM, O. O. olson...@yahoo.com wrote:
 Andrew Garrett wrote:
 On Sat, Mar 14, 2009 at 9:26 AM, O. O. olson...@yahoo.com wrote:
        The above link says that “only articles” and no redirects are in the
 namespace NS0. Also Talk: pages are not included in the NS0.
 Then, when the current English Wikipedia advertises 2,791,033 Articles,
 I cannot understand why the list of Titles contains 5716820 Titles? This
 is a little more than double?

 The larger number includes redirects, the smaller number doesn't.

 Then why does this http://en.wikipedia.org/wiki/Wikipedia:NS0 say that
 “Redirects” are not considered as Articles and hence are not in NS0?

It doesn't say that, it says Not all pages in the article namespace
are considered to be articles, listing redirects as an example.

-- 
Andrew Garrett
Sent from: Sydney New South Wales Australia.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] not all tables need to be backed up

2009-03-13 Thread jidanni
Gentlemen, it occurred to me that under close examination one finds
that when making a backup of one's wiki's database, some of the tables
dumped have various degrees of temporariness, and thus though needing
to be present in a proper dump, could perhaps be emptied of their
values, saving much space in the SQL.bz2 etc. file produced.

Looking at the mysqldump man page, one finds no perfect options to do
so, so instead makes one's own script:

$ mysqldump my_database|
perl -nwle 'BEGIN{$dontdump=wiki_(objectcache|searchindex)}
s/(^-- )(Dumping data for table `$dontdump`$)/$1NOT $2/;
next if /^LOCK TABLES `$dontdump` WRITE;$/../^UNLOCK TABLES;$/;
print;'

Though not myself daring to make any recommendations on
http://www.mediawiki.org/wiki/Manual:Backing_up_a_wiki#Tables
I am still curious which tables can be emptied always,
which can be emptied if one is willing to remember to run a
maintenance script to resurrect their contents, etc.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l