A few notes:

1) It sounds like you're recreating all the logic of importing a dump into
a SQL database, which may be introducing problems if you have bugs in your
code. For instance you may be mistakenly treating namespaces as text
strings instead of numbers, or failing to escape things, or missing
something else. I would recommend using one of the many existing tools for
importing a dump, such as mwdumper or xml2sql:

https://www.mediawiki.org/wiki/Manual:Importing_XML_dumps#Using_mwdumper

2) Make sure you've got a dump that includes the templates and lua modules
etc. It sounds like either you don't have the Template: pages or your
import process does not handle namespaces correctly.

3) Make sure you've got all the necessary extensions to replicate the wiki
you're using a dump from, such as Lua. Many templates on Wikipedia call Lua
modules, and won't work without them.

4) Not sure what "not web friendly" means regarding titles?

-- brion


On Mon, Sep 21, 2015 at 11:50 AM, v0id null <v0idn...@gmail.com> wrote:

> Hello Everyone,
>
> I've been trying to write a python script that will take an XML dump, and
> generate all HTML, using Mediawiki itself to handle all the
> parsing/processing, but I've run into a problem where all the parsed output
> have warnings that templates couldn't be found. I'm not sure what I'm doing
> wrong.
>
> So I'll explain my steps:
>
> First I execute the SQL script maintenance/table.sql
>
> Then I remove some indexes from the tables to speed up insertion.
>
> Finally I go through the XML which will execute the following insert
> statements:
>
>  'insert into page
>     (page_id, page_namespace, page_title, page_is_redirect, page_is_new,
> page_random,
>      page_latest, page_len, page_content_model) values (%s, %s, %s, %s, %s,
> %s, %s, %s, %s)'
>
> 'insert into text (old_id, old_text) values (%s, %s)'
>
> 'insert into recentchanges (rc_id, rc_timestamp, rc_user, rc_user_text,
>    rc_title, rc_minor, rc_bot, rc_cur_id, rc_this_oldid, rc_last_oldid,
>    rc_type, rc_source, rc_patrolled, rc_ip, rc_old_len, rc_new_len,
> rc_deleted,
>    rc_logid)
>    values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s,
> %s, %s)'
>
> 'insert into revision
>     (rev_id, rev_page, rev_text_id, rev_user, rev_user_text, rev_timestamp,
>      rev_minor_edit, rev_deleted, rev_len, rev_parent_id, rev_sha1)
>       values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)'
>
> All IDs from the XML dump are kept. I noticed that the titles are not web
> friendly. Thinking this was the problem I ran the
> maintenance/cleanupTitles.php script but it didn't seem to fix any thing.
>
> Doing this, I can now run the following PHP script:
>     $id = 'some revision id'
>     $rev = Revision::newFromId( $id );
>     $titleObj = $rev->getTitle();
>     $pageObj = WikiPage::factory( $titleObj );
>
>     $context = RequestContext::newExtraneousContext($titleObj);
>
>     $popts = ParserOptions::newFromContext($context);
>     $pout = $pageObj->getParserOutput($popts);
>
>     var_dump($pout);
>
> The mText property of $pout contains the parsed output, but it is full of
> stuff like this:
>
> <a href="/index.php?title=Template:Date&action=edit&redlink=1" class="new"
> title="Template:Date (page does not exist)">Template:Date</a>
>
>
> I feel like I'm missing a step here. I tried importing the templatelinks
> SQL dump, but it also did not fix anything. It also did not include any
> header or footer which would be useful.
>
> Any insight or help is much appreciated, thank you.
>
> --alex
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to