http://dumps.wikimedia.org/enwikinews/latest/enwikinews-latest-pages-articles.xml.bz2
this one. I believe this was to contain all latest revisions of all pages. I do see that there are template pages in there, at least, they are pages with a title in the format of Template:[some template name] On Mon, Sep 21, 2015 at 2:53 PM, John <phoenixoverr...@gmail.com> wrote: > What kind of dump are you working from? > > > On Mon, Sep 21, 2015 at 2:50 PM, v0id null <v0idn...@gmail.com> wrote: > > > Hello Everyone, > > > > I've been trying to write a python script that will take an XML dump, and > > generate all HTML, using Mediawiki itself to handle all the > > parsing/processing, but I've run into a problem where all the parsed > output > > have warnings that templates couldn't be found. I'm not sure what I'm > doing > > wrong. > > > > So I'll explain my steps: > > > > First I execute the SQL script maintenance/table.sql > > > > Then I remove some indexes from the tables to speed up insertion. > > > > Finally I go through the XML which will execute the following insert > > statements: > > > > 'insert into page > > (page_id, page_namespace, page_title, page_is_redirect, page_is_new, > > page_random, > > page_latest, page_len, page_content_model) values (%s, %s, %s, %s, > %s, > > %s, %s, %s, %s)' > > > > 'insert into text (old_id, old_text) values (%s, %s)' > > > > 'insert into recentchanges (rc_id, rc_timestamp, rc_user, rc_user_text, > > rc_title, rc_minor, rc_bot, rc_cur_id, rc_this_oldid, rc_last_oldid, > > rc_type, rc_source, rc_patrolled, rc_ip, rc_old_len, rc_new_len, > > rc_deleted, > > rc_logid) > > values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, > %s, > > %s, %s)' > > > > 'insert into revision > > (rev_id, rev_page, rev_text_id, rev_user, rev_user_text, > rev_timestamp, > > rev_minor_edit, rev_deleted, rev_len, rev_parent_id, rev_sha1) > > values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)' > > > > All IDs from the XML dump are kept. I noticed that the titles are not web > > friendly. Thinking this was the problem I ran the > > maintenance/cleanupTitles.php script but it didn't seem to fix any thing. > > > > Doing this, I can now run the following PHP script: > > $id = 'some revision id' > > $rev = Revision::newFromId( $id ); > > $titleObj = $rev->getTitle(); > > $pageObj = WikiPage::factory( $titleObj ); > > > > $context = RequestContext::newExtraneousContext($titleObj); > > > > $popts = ParserOptions::newFromContext($context); > > $pout = $pageObj->getParserOutput($popts); > > > > var_dump($pout); > > > > The mText property of $pout contains the parsed output, but it is full of > > stuff like this: > > > > <a href="/index.php?title=Template:Date&action=edit&redlink=1" > class="new" > > title="Template:Date (page does not exist)">Template:Date</a> > > > > > > I feel like I'm missing a step here. I tried importing the templatelinks > > SQL dump, but it also did not fix anything. It also did not include any > > header or footer which would be useful. > > > > Any insight or help is much appreciated, thank you. > > > > --alex > > _______________________________________________ > > Wikitech-l mailing list > > Wikitech-l@lists.wikimedia.org > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > _______________________________________________ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l