Re: [Wikitech-l] Importing XML Dumps - templates not working

gnosygnu Tue, 22 Sep 2015 20:10:26 -0700

Hi alex. I added some notes below based on my experience. (I'm the
developer for XOWA (http://gnosygnu.github.io/xowa/) which generates
offline wikis from the Wikimedia XML dumps) Feel free to follow up on-list
or off-list if you are interested. Thanks.


On Mon, Sep 21, 2015 at 3:09 PM, v0id null <v0idn...@gmail.com> wrote:

> #1: mwdumper has not been updated in a very long time. I did try to use it,
> but it did not seem to work properly. I don't entirely remember what the
> problem was but I believe it was related to schema incompatibility. xml2sql
> comes with a warning about having to rebuild links. Considering that I'm
> just in a command line and passing in page IDs manually, do I really need
> to worry about it? I'd be thrilled not to have to reinvent the wheel here.
>


> #2: Is there some way to figure it out? as I showed in a previous reply,
> the template that it can't find, is there in the page table.
>
> As brion indicated, you need to strip the namespace name. The XML dump
also has a "namespaces" node near the beginning. It lists every namespace
in the wiki with "name" and "ID". You can use a rule like "if the title
starts with a namespace and a colon, strip it". Hence, a title like
"Template:Date" starts with "Template:" and goes into the page table with a
title of just "Date" and a namespace of "10" (the namespace id for
"Template").


> #3: Those lua modules, are they stock modules included with the mediawiki
> software, or something much more custom? If the latter, are they available
> to download somewhere?
>
> Yes, these are articles with a title starting with "Module:". They will be
in the pages-articles.xml.bz2 dump. You should make sure you have Scribunto
set up on your wiki, or else it won't use them. See:
https://www.mediawiki.org/wiki/Extension:Scribunto


> #4: I'm not any expert on mediawiki, but it seems when that the titles in
> the xml dump need to be formatted, mainly replacing spaces with
> underscores.
>
> Yes, surprisingly, the only change you'll need to make is to replace
spaces with underscores.

Hope this helps.


> thanks for the response
> --alex
>
> On Mon, Sep 21, 2015 at 3:00 PM, Brion Vibber <bvib...@wikimedia.org>
> wrote:
>
> > A few notes:
> >
> > 1) It sounds like you're recreating all the logic of importing a dump
> into
> > a SQL database, which may be introducing problems if you have bugs in
> your
> > code. For instance you may be mistakenly treating namespaces as text
> > strings instead of numbers, or failing to escape things, or missing
> > something else. I would recommend using one of the many existing tools
> for
> > importing a dump, such as mwdumper or xml2sql:
> >
> > https://www.mediawiki.org/wiki/Manual:Importing_XML_dumps#Using_mwdumper
> >
> > 2) Make sure you've got a dump that includes the templates and lua
> modules
> > etc. It sounds like either you don't have the Template: pages or your
> > import process does not handle namespaces correctly.
> >
> > 3) Make sure you've got all the necessary extensions to replicate the
> wiki
> > you're using a dump from, such as Lua. Many templates on Wikipedia call
> Lua
> > modules, and won't work without them.
> >
> > 4) Not sure what "not web friendly" means regarding titles?
> >
> > -- brion
> >
> >
> > On Mon, Sep 21, 2015 at 11:50 AM, v0id null <v0idn...@gmail.com> wrote:
> >
> > > Hello Everyone,
> > >
> > > I've been trying to write a python script that will take an XML dump,
> and
> > > generate all HTML, using Mediawiki itself to handle all the
> > > parsing/processing, but I've run into a problem where all the parsed
> > output
> > > have warnings that templates couldn't be found. I'm not sure what I'm
> > doing
> > > wrong.
> > >
> > > So I'll explain my steps:
> > >
> > > First I execute the SQL script maintenance/table.sql
> > >
> > > Then I remove some indexes from the tables to speed up insertion.
> > >
> > > Finally I go through the XML which will execute the following insert
> > > statements:
> > >
> > >  'insert into page
> > >     (page_id, page_namespace, page_title, page_is_redirect,
> page_is_new,
> > > page_random,
> > >      page_latest, page_len, page_content_model) values (%s, %s, %s, %s,
> > %s,
> > > %s, %s, %s, %s)'
> > >
> > > 'insert into text (old_id, old_text) values (%s, %s)'
> > >
> > > 'insert into recentchanges (rc_id, rc_timestamp, rc_user, rc_user_text,
> > >    rc_title, rc_minor, rc_bot, rc_cur_id, rc_this_oldid, rc_last_oldid,
> > >    rc_type, rc_source, rc_patrolled, rc_ip, rc_old_len, rc_new_len,
> > > rc_deleted,
> > >    rc_logid)
> > >    values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s,
> > %s,
> > > %s, %s)'
> > >
> > > 'insert into revision
> > >     (rev_id, rev_page, rev_text_id, rev_user, rev_user_text,
> > rev_timestamp,
> > >      rev_minor_edit, rev_deleted, rev_len, rev_parent_id, rev_sha1)
> > >       values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)'
> > >
> > > All IDs from the XML dump are kept. I noticed that the titles are not
> web
> > > friendly. Thinking this was the problem I ran the
> > > maintenance/cleanupTitles.php script but it didn't seem to fix any
> thing.
> > >
> > > Doing this, I can now run the following PHP script:
> > >     $id = 'some revision id'
> > >     $rev = Revision::newFromId( $id );
> > >     $titleObj = $rev->getTitle();
> > >     $pageObj = WikiPage::factory( $titleObj );
> > >
> > >     $context = RequestContext::newExtraneousContext($titleObj);
> > >
> > >     $popts = ParserOptions::newFromContext($context);
> > >     $pout = $pageObj->getParserOutput($popts);
> > >
> > >     var_dump($pout);
> > >
> > > The mText property of $pout contains the parsed output, but it is full
> of
> > > stuff like this:
> > >
> > > <a href="/index.php?title=Template:Date&action=edit&redlink=1"
> > class="new"
> > > title="Template:Date (page does not exist)">Template:Date</a>
> > >
> > >
> > > I feel like I'm missing a step here. I tried importing the
> templatelinks
> > > SQL dump, but it also did not fix anything. It also did not include any
> > > header or footer which would be useful.
> > >
> > > Any insight or help is much appreciated, thank you.
> > >
> > > --alex
> > > _______________________________________________
> > > Wikitech-l mailing list
> > > Wikitech-l@lists.wikimedia.org
> > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> > _______________________________________________
> > Wikitech-l mailing list
> > Wikitech-l@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Importing XML Dumps - templates not working

Reply via email to