Re: [CODE4LIB] munging wikimedia

David Pattern Sun, 10 Sep 2006 07:08:26 -0700

Hi Eric
 
The best place to look is probably 
http://meta.wikimedia.org/wiki/Alternative_parsers 
 
I'm guessing the "non-parser dumper", which uses MediaWiki's internal code to 
the do rendering, might be the a good choice.
 
regards
Dave Pattern
University of Huddersfield

________________________________

From: Code for Libraries on behalf of Eric Lease Morgan
Sent: Sun 10/09/2006 14:28
To: [email protected]
Subject: [CODE4LIB] munging wikimedia

How do I go about munging wikimedia content?

After realizing that downloadable data dumps of Wikipedia are sorted
by language code, I was able to acquire the 1.6 GB compressed data,
uncompress it, parse it with Parse::MediaWikiDump, and output things
like article title and article text.

The text contains all sorts of wikimedia mark-up: [[]], \\, #, ==, *,
etc. I suppose someone has already written something that converts
this markup into HTML and/or plain text, but I can't find anything.

If you were to get the Wikipeda content, cache it locally, index it,
and provide access to the index, then how would you deal with the
Wiki mark-up?

--
Eric Lease Morgan
University Libraries of Notre Dame

This transmission is confidential and may be legally privileged. If you receive 
it in error, please notify us immediately by e-mail and remove it from your 
system. If the content of this e-mail does not relate to the business of the 
University of Huddersfield, then we do not endorse it and will accept no 
liability.

Re: [CODE4LIB] munging wikimedia

Reply via email to