As described on Phabricator a bug [1] surfaced whereby the "pages-articles"
XML dumps on https://dumps.wikimedia.org/ bear incomplete records.

A possible fix has been identified, and it involves bumping the dump schema
version from version 0.10 to version 0.11 [2], which could be a breaking
change for some.

MORE DETAILS:

Due to the bug that surfaced, a nontrivial number of <text> nodes
representing article text shows in a fashion like so as empty.

<text bytes="123456789" />

A potential fix in T365155 [3] has been identified. Assuming further
testing looks good, XML dumps will be kicked off again starting next week
in order to restore the missing records as soon as possible. It will take a
while for new dumps to be generated as it is a compute intensive operation.
More progress will be reported at T365155 and new dumps will eventually
show up on dumps.wikimedia.org .

Although a number of pipelines may not notice the change associated with
the schema bump, if your dump ingestion tooling or use of Special:Export
relies on the specific shape of the XML at version 0.10 (e.g., because of
code generation tools), please examine the differences between version 0.10
and version 0.11. One notable addition in version 0.11 is addition of MCR
[4] fields.

Thank you for your patience while this issue is resolved.

-Adam

[1]
https://phabricator.wikimedia.org/T365501

[2]
https://www.mediawiki.org/xml/export-0.10.xsd

and

https://www.mediawiki.org/xml/export-0.11.xsd

Schema version 0.11 has existed in MediaWiki for over 6 years, but
Wikimedia wikis have been using version 0.10.

[3]
https://phabricator.wikimedia.org/T365155#9851025

and

https://phabricator.wikimedia.org/T365155#9851160

[4]
https://www.mediawiki.org/wiki/Multi-Content_Revisions
_______________________________________________
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/

Reply via email to