Hi Bruno,
I have been using the WikiExtractor for this task:
https://github.com/attardi/wikiextractor
Hope this helps.
Cheers,
Marco
On 2/22/16 23:32, wiki-research-l-requ...@lists.wikimedia.org wrote:
Date: Mon, 22 Feb 2016 23:12:08 +0100
From: "Federico Leva (Nemo)"<nemow...@gmail.com>
To: Research into Wikimedia content and communities
<wiki-research-l@lists.wikimedia.org>
Subject: Re: [Wiki-research-l] "Quick" request
Message-ID:<56cb87b8.9050...@gmail.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Bruno Goncalves, 22/02/2016 22:58:
>There used to be official HTML dumps
>https://dumps.wikimedia.org/other/static_html_dumps/ but they haven't
>been updated in almost a decade:)
The job is effectively done by Kiwix now.
http://download.kiwix.org/zim/wikipedia/
For instance:
wikipedia_en_all_nopic_2015-05.zim 17-May-2015 10:27 15G
There are several tools to extract the HTML from a ZIM file:
http://www.openzim.org/wiki/Readers
Nemo
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l