Let me talk a little bit about the bundle format, briefly:

* It is intended to be a complete copy of all wiki resources required to
make an offline dump, in any format.  That means that all the articles are
spidered and template-expanded and all related images and other media are
fetched and stored in a zip archive.  The archive also will contain all
license and authorship information needed to make the attributions, etc,
needed for a license-compliant rendering.  This should provide developers
of rendering backends a substantial headstart.

* The current bundle format is backwards compatible with the pediapress
bundles.  We have made some additions, primarily having to do with better
disambiguating table keys/filenames/etc to deal with collections which span
multiple wikis.  We also add the parsoid parser output.

* The backwards-compatibility features are somewhat experimental.  As
Matthew noted, the plan is for pediapress to eventually begin hosting their
bundler on their own servers.  We hope that they will be able to share our
bundles, but that decision is up to them.  We may deprecate some of the
backwards-compatibility content of the bundles (for example, removing the
PHP parser output) if no one ends up using them.  (None the less, having
pediapress' working bundle format was very helpful to me in writing the new
bundler, and I want to thank them!)

* I've made a conscious effort to support *very large* bundles in this
format.  That is, I try not to hold complete data relating to a bundle in
memory, and we use sqlite databases wherever possible to support
article-at-a-time access during rendering.  The MW-hosted servers will
probably have reasonably-small resource limits, but it is my intention that
if you want to create an offline dump of an entire wiki (or large subset
thereof), then you should be able to use the existing renderers and bundler
to do so.  I'd encourage people interested in making large slices to get in
touch and hopefully start playing with the code, so we can identify any
bundle-format related bottlenecks and eliminate them before the bundle
format is too firmly established.
​
* The bundler (and latex renderer) are independent npm modules, loosely
coupled to the Collection extension.  Again, this should encourage reuse of
the bundler and renderer in other projects.  Patches welcome!

http://git.wikimedia.org/summary/mediawiki%2Fextensions%2FCollection%2FOfflineContentGenerator%2Fbundler.git

http://git.wikimedia.org/summary/mediawiki%2Fextensions%2FCollection%2FOfflineContentGenerator%2Flatex_renderer.git
The npm module name is still in flux.  It's currently mw-bundler and
mw-latexer, maybe mw-ocg-bundler etc would be better.
  --scott
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to