> One possibility is considering storing rendered HTML for old revisions. It
> lets wikitext (and hence parser) evolve without breaking old revisions.
Plus
> rendered HTML will use the template revision at the time it was rendered
vs.
> the latest revision (this is the problem Memento tries to solve).

Long term HTML archival is a something we have been gradually working
towards with RESTBase.

Since HTML is about 10x larger than wikitext, a major concern is storage
cost. Old estimates <https://phabricator.wikimedia.org/T97710> put the
total storage needed to store one HTML copy of each revision at roughly
120T. To reduce this cost, we have since implemented several improvements
<https://phabricator.wikimedia.org/T93751>:


   - Brotli compression <https://en.wikipedia.org/wiki/Brotli>, once
   deployed, is expected to reduce the total storage needs to about
   1/4-1/5x over gzip <https://phabricator.wikimedia.org/T122028#2004953>.
   - The ability to split latest revisions from old revision lets us use
   cheaper and slower storage for old revisions.
   - Retention policies let us specify how many renders per revision we
   want to archive. We currently only archive one (the latest) render per
   revision, but have the option to store one render per $time_unit. This is
   especially important for pages like [[Main Page]], which are rarely edited,
   but constantly change their content in meaningful ways via templates. It is
   currently not possible to reliably cite such pages, without resorting to
   external services like archive.org.


Another important requirement for making HTML a useful long-term archival
medium is to establish a clear standard for HTML structures used. The
versioned Parsoid HTML spec
<https://www.mediawiki.org/wiki/Specs/HTML/1.2.1>, along with format
migration logic for old content, are designed to make the stored HTML as
future-proof as possible.

While we currently only have space for a few months worth of HTML
revisions, we do expect the changes above to make it possible to push this
to years in the foreseeable future without unreasonable hardware needs.
This means that we can start building up an archive of our content in a
format that is not tied to the software.

Faithfully re-rendering old revisions is harder in retrospect. We will
likely have to make some trade-offs between fidelity & effort.

Gabriel


On Mon, Aug 1, 2016 at 2:01 PM, David Gerard <dger...@gmail.com> wrote:

> On 1 August 2016 at 17:37, Marc-Andre <m...@uberbox.org> wrote:
>
> > We need to find a long-term view to a solution.  I don't mean just
> keeping
> > old versions of the software around - that would be of limited help.
> It's
> > be an interesting nightmare to try to run early versions of phase3
> nowadays,
> > and probably require managing to make a very very old distro work and
> > finding the right versions of an ancient apache and PHP.  Even *building*
> > those might end up being a challenge... when is the last time you saw a
> > working egcs install? I shudder how nigh-impossible the task might be 100
> > years from now.
>
>
> oh god yes. I'm having this now, trying to revive an old Slash
> installation. I'm not sure I could even reconstruct a box to run it
> without compiling half of CPAN circa 2002 from source.
>
> Suggestion: set up a copy of WMF's setup on a VM (or two or three),
> save that VM and bundle it off to the Internet Archive as a dated
> archive resource. Do this regularly.
>
>
> - d.
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>



-- 
Gabriel Wicke
Principal Engineer, Wikimedia Foundation
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to