There's an ongoing discussion in ops about improving the dump process, see

I would like to join in and add our requirements and thoughts to the list, and
would like some input on that. So far I have:

Make it easier to register a new type of dump via a config change.
A dump should define:
* a script(s) to run
* output file(s)
* the dump schedule
* a short name
* brief description (wikitext or HTML? translatable?)
* required input files (maybe)

Make clear timelines of consistent dumps.
* drop the misleading "one dir with one timestamp for all dumps" appraoch
* have one timeline per dump instead
* for dumps that are guaranteed to be consistent (one generated from the other),
generate a timeline of directories with symlinks to the actual files.

Make dumps discoverable:
* There should be a machine readable overview of which dumps exist in which
versions for each project.
* This overview should be a JSON document (may even be static)
* Perhaps we also want a DCAT-AP description of our dumps

Promote stable URLs:
* The latest dump of any type should be available under a stable, predictable 
* TBD: "latest" URL could point to a symlink, get rewritten to the actual file,
or trigger an HTTP redirect.

Thoughts? Comments? Additions?

Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

Wikidata-tech mailing list

Reply via email to