[Xmldatadumps-l] Re: "Experimental" Status of Enterprise HTML Dumps

Jan Berkel Mon, 08 May 2023 01:22:38 -0700

On Fri, 5 May 2023, at 22:53, Evan Lloyd New-Schmidt wrote:
> Hi, I'm starting a project that will involve repeated processing of HTML 
> wikipedia articles.
>
> Using the enterprise dumps seems like it would be much simpler than 
> converting the XML dumps, but I don't know what the "experimental" 
> status really means.


Hi,

From my experience working with the Wiktionary HTML dumps I can say that the 
data quality is quite poor: there are stale and missing entries 
(https://phabricator.wikimedia.org/T305407). 

There are also entire namespaces excluded from the dumps, and more recently 
there have been issues with the dumps not getting updated.

So it depends what kind of processing you need to do–in general I find the 
parsing to be much easier, hopefully they'll manage to sort out the problems.

 Jan
_______________________________________________
Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org

[Xmldatadumps-l] Re: "Experimental" Status of Enterprise HTML Dumps

Reply via email to