There's an ongoing discussion in ops about improving the dump process, see

 https://phabricator.wikimedia.org/T88728
 https://phabricator.wikimedia.org/T93396
 https://phabricator.wikimedia.org/T17017
 
https://www.mediawiki.org/wiki/Wikimedia_MediaWiki_Core_Team/Backlog/Improve_dumps

I would like to join in and add our requirements and thoughts to the list, and
would like some input on that. So far I have:

Make it easier to register a new type of dump via a config change.
A dump should define:
* a script(s) to run
* output file(s)
* the dump schedule
* a short name
* brief description (wikitext or HTML? translatable?)
* required input files (maybe)

Make clear timelines of consistent dumps.
* drop the misleading "one dir with one timestamp for all dumps" appraoch
* have one timeline per dump instead
* for dumps that are guaranteed to be consistent (one generated from the other),
generate a timeline of directories with symlinks to the actual files.

Make dumps discoverable:
* There should be a machine readable overview of which dumps exist in which
versions for each project.
* This overview should be a JSON document (may even be static)
* Perhaps we also want a DCAT-AP description of our dumps

Promote stable URLs:
* The latest dump of any type should be available under a stable, predictable 
URL.
* TBD: "latest" URL could point to a symlink, get rewritten to the actual file,
or trigger an HTTP redirect.



Thoughts? Comments? Additions?


-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

_______________________________________________
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech

Reply via email to