Thanks for the input everyone. I was not aware that importing the XML dumps
was so involved.

In the end I used xml2sql, but it required two patches, and a bit more work
on my end, to get it to work. I also had to strip out the
<DiscussionThreading> tag from the xml dump. But nevertheless it is very
fast.

For those wondering, I'm toying around with an automated news categorizer
and wanted to use Wikinews as a corpus. Not perfect, but this is just
hobbyist level stuff here. I'm using nltk so I wanted to keep things
python-centric, but I've written up a PHP script that runs as a simple tcp
server that my python script can connect to and ask for the HTML output. My
python script first downloads mediawiki, the right xml dump, unzips
everything, sets up LocalSettings.php, compiles xml2sql, runs it then
imports the sql into the database. So essentially automates making an
offline installation of what I assume is any mediawiki xml dump. Then it
starts that simple PHP server (using plain sockets), and just sends it page
IDs and it responds with the fully rendered HTML including headers and
footers.

I figure this approach, I can run a few forks on the python and php side to
speed up the process.

then I use python to parse through the HTML to get whatever I need from the
page, which for now are the categories and the article content, which I can
then use to train classifiers from nltk.

maybe not the easiest approach, but it does make it easy to use. I've
looked at the python parsers but none of them seem like they will be as
successful or as correct as using Mediawiki itself.

---alex

On Tue, Sep 22, 2015 at 11:09 PM, gnosygnu <gnosy...@gmail.com> wrote:

> Hi alex. I added some notes below based on my experience. (I'm the
> developer for XOWA (http://gnosygnu.github.io/xowa/) which generates
> offline wikis from the Wikimedia XML dumps) Feel free to follow up on-list
> or off-list if you are interested. Thanks.
>
> On Mon, Sep 21, 2015 at 3:09 PM, v0id null <v0idn...@gmail.com> wrote:
>
> > #1: mwdumper has not been updated in a very long time. I did try to use
> it,
> > but it did not seem to work properly. I don't entirely remember what the
> > problem was but I believe it was related to schema incompatibility.
> xml2sql
> > comes with a warning about having to rebuild links. Considering that I'm
> > just in a command line and passing in page IDs manually, do I really need
> > to worry about it? I'd be thrilled not to have to reinvent the wheel
> here.
> >
>
>
> > #2: Is there some way to figure it out? as I showed in a previous reply,
> > the template that it can't find, is there in the page table.
> >
> > As brion indicated, you need to strip the namespace name. The XML dump
> also has a "namespaces" node near the beginning. It lists every namespace
> in the wiki with "name" and "ID". You can use a rule like "if the title
> starts with a namespace and a colon, strip it". Hence, a title like
> "Template:Date" starts with "Template:" and goes into the page table with a
> title of just "Date" and a namespace of "10" (the namespace id for
> "Template").
>
>
> > #3: Those lua modules, are they stock modules included with the mediawiki
> > software, or something much more custom? If the latter, are they
> available
> > to download somewhere?
> >
> > Yes, these are articles with a title starting with "Module:". They will
> be
> in the pages-articles.xml.bz2 dump. You should make sure you have Scribunto
> set up on your wiki, or else it won't use them. See:
> https://www.mediawiki.org/wiki/Extension:Scribunto
>
>
> > #4: I'm not any expert on mediawiki, but it seems when that the titles in
> > the xml dump need to be formatted, mainly replacing spaces with
> > underscores.
> >
> > Yes, surprisingly, the only change you'll need to make is to replace
> spaces with underscores.
>
> Hope this helps.
>
>
> > thanks for the response
> > --alex
> >
> > On Mon, Sep 21, 2015 at 3:00 PM, Brion Vibber <bvib...@wikimedia.org>
> > wrote:
> >
> > > A few notes:
> > >
> > > 1) It sounds like you're recreating all the logic of importing a dump
> > into
> > > a SQL database, which may be introducing problems if you have bugs in
> > your
> > > code. For instance you may be mistakenly treating namespaces as text
> > > strings instead of numbers, or failing to escape things, or missing
> > > something else. I would recommend using one of the many existing tools
> > for
> > > importing a dump, such as mwdumper or xml2sql:
> > >
> > >
> https://www.mediawiki.org/wiki/Manual:Importing_XML_dumps#Using_mwdumper
> > >
> > > 2) Make sure you've got a dump that includes the templates and lua
> > modules
> > > etc. It sounds like either you don't have the Template: pages or your
> > > import process does not handle namespaces correctly.
> > >
> > > 3) Make sure you've got all the necessary extensions to replicate the
> > wiki
> > > you're using a dump from, such as Lua. Many templates on Wikipedia call
> > Lua
> > > modules, and won't work without them.
> > >
> > > 4) Not sure what "not web friendly" means regarding titles?
> > >
> > > -- brion
> > >
> > >
> > > On Mon, Sep 21, 2015 at 11:50 AM, v0id null <v0idn...@gmail.com>
> wrote:
> > >
> > > > Hello Everyone,
> > > >
> > > > I've been trying to write a python script that will take an XML dump,
> > and
> > > > generate all HTML, using Mediawiki itself to handle all the
> > > > parsing/processing, but I've run into a problem where all the parsed
> > > output
> > > > have warnings that templates couldn't be found. I'm not sure what I'm
> > > doing
> > > > wrong.
> > > >
> > > > So I'll explain my steps:
> > > >
> > > > First I execute the SQL script maintenance/table.sql
> > > >
> > > > Then I remove some indexes from the tables to speed up insertion.
> > > >
> > > > Finally I go through the XML which will execute the following insert
> > > > statements:
> > > >
> > > >  'insert into page
> > > >     (page_id, page_namespace, page_title, page_is_redirect,
> > page_is_new,
> > > > page_random,
> > > >      page_latest, page_len, page_content_model) values (%s, %s, %s,
> %s,
> > > %s,
> > > > %s, %s, %s, %s)'
> > > >
> > > > 'insert into text (old_id, old_text) values (%s, %s)'
> > > >
> > > > 'insert into recentchanges (rc_id, rc_timestamp, rc_user,
> rc_user_text,
> > > >    rc_title, rc_minor, rc_bot, rc_cur_id, rc_this_oldid,
> rc_last_oldid,
> > > >    rc_type, rc_source, rc_patrolled, rc_ip, rc_old_len, rc_new_len,
> > > > rc_deleted,
> > > >    rc_logid)
> > > >    values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s,
> %s,
> > > %s,
> > > > %s, %s)'
> > > >
> > > > 'insert into revision
> > > >     (rev_id, rev_page, rev_text_id, rev_user, rev_user_text,
> > > rev_timestamp,
> > > >      rev_minor_edit, rev_deleted, rev_len, rev_parent_id, rev_sha1)
> > > >       values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)'
> > > >
> > > > All IDs from the XML dump are kept. I noticed that the titles are not
> > web
> > > > friendly. Thinking this was the problem I ran the
> > > > maintenance/cleanupTitles.php script but it didn't seem to fix any
> > thing.
> > > >
> > > > Doing this, I can now run the following PHP script:
> > > >     $id = 'some revision id'
> > > >     $rev = Revision::newFromId( $id );
> > > >     $titleObj = $rev->getTitle();
> > > >     $pageObj = WikiPage::factory( $titleObj );
> > > >
> > > >     $context = RequestContext::newExtraneousContext($titleObj);
> > > >
> > > >     $popts = ParserOptions::newFromContext($context);
> > > >     $pout = $pageObj->getParserOutput($popts);
> > > >
> > > >     var_dump($pout);
> > > >
> > > > The mText property of $pout contains the parsed output, but it is
> full
> > of
> > > > stuff like this:
> > > >
> > > > <a href="/index.php?title=Template:Date&action=edit&redlink=1"
> > > class="new"
> > > > title="Template:Date (page does not exist)">Template:Date</a>
> > > >
> > > >
> > > > I feel like I'm missing a step here. I tried importing the
> > templatelinks
> > > > SQL dump, but it also did not fix anything. It also did not include
> any
> > > > header or footer which would be useful.
> > > >
> > > > Any insight or help is much appreciated, thank you.
> > > >
> > > > --alex
> > > > _______________________________________________
> > > > Wikitech-l mailing list
> > > > Wikitech-l@lists.wikimedia.org
> > > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> > > _______________________________________________
> > > Wikitech-l mailing list
> > > Wikitech-l@lists.wikimedia.org
> > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> > >
> > _______________________________________________
> > Wikitech-l mailing list
> > Wikitech-l@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to