On 27/01/2011 14:35, Luigi Assom wrote:
> Here another question,
> different topic:
>
> we would like to examine the network property of the wiki.
> There are already some results here an there, though we would like to
> have a closer look at it, to eventually improve the knowledge base.
>
> To do that, we need to access the pages of wiki (only articles by now),
> with article name, abstract, meta keys, internal hyperlinks connecting
> them, and external hyperlinks base.
>
> We found the db list in gz but they are very large files, and here my
> question.
> how to manipulate them with phpmyadmin?
> any other open source tool to handle datafiles of such size?
>
> an easy way to get first results would be to have the db of articles
> with above parameters in xml sheet.
> Also a portion of it would be interesting for a demo project to work on.
>

Hi Luigi,
there are various tools for reading XML dump files and importing them 
into MySQL, which is probably the best option if you want to handle very 
large files like the dumps for the English wikipedia. See here: 
http://meta.wikimedia.org/wiki/Data_dumps#Tools

If you're only interested in a subset of the articles, and just in the 
current revisions, another possibility is crawling the website via the 
Mediawiki API http://www.mediawiki.org/wiki/API
There are several client libraries, a Google query for you favourite 
language should return you some pointers.

-- 
Giovanni L. Ciampaglia
PhD Student
University of Lugano, MACS Lab

_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Reply via email to