Re: [Wiki-research-l] "Quick" request

Marco Fossati Tue, 23 Feb 2016 01:11:21 -0800

Hi Bruno,

I have been using the WikiExtractor for this task:
https://github.com/attardi/wikiextractor


Hope this helps.
Cheers,

Marco

On 2/22/16 23:32, wiki-research-l-requ...@lists.wikimedia.org wrote:

Date: Mon, 22 Feb 2016 23:12:08 +0100
From: "Federico Leva (Nemo)"<nemow...@gmail.com>
To: Research into Wikimedia content and communities
        <wiki-research-l@lists.wikimedia.org>
Subject: Re: [Wiki-research-l] "Quick" request
Message-ID:<56cb87b8.9050...@gmail.com>
Content-Type: text/plain; charset=utf-8; format=flowed

Bruno Goncalves, 22/02/2016 22:58:

>There used to be official HTML dumps
>https://dumps.wikimedia.org/other/static_html_dumps/  but they haven't
>been updated in almost a decade:)

The job is effectively done by Kiwix now.
http://download.kiwix.org/zim/wikipedia/
For instance:
    wikipedia_en_all_nopic_2015-05.zim        17-May-2015 10:27   15G

There are several tools to extract the HTML from a ZIM file:
http://www.openzim.org/wiki/Readers

Nemo


_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] "Quick" request

Reply via email to