On Tue, Nov 3, 2020 at 1:59 AM Daniel Kinzler <dkinz...@wikimedia.org>
wrote:

> TemplateData already uses JSON serialization, but then compresses the JSON
> output, to make the data fit into the page_props table. This results in
> binary data in ParserOutput, which we can't directly put into JSON.


I'm not sure I understand the problem. Binary data can be trivially
represented as JSON, by treating it as a string. Is it an issue of storage
size? JSON escaping of the control characters is (assuming binary data with
a somewhat random distribution of bytes) an ~50% size increase, UTF-8
encoding the top half of bytes is another 50%, so it will approximately
double the length - certainly worse than the ~33% increase for base64, but
not tragic. (And if size increase matters that much, you probably shouldn't
be using base64 either.)

* Don't write the data to page_props, treat it as extension data in
> ParserOutput. Compression would become unnecessary. However, batch loading
> of the data becomes much slower, since each ParserOutput needs to be loaded
> from ParserCache. Would it be too slow?
>

It would also mean that fetching template data or some other page property
might require a parse, as parser cache entries expire.
It would also also mean the properties could not be searched, which I think
is a dealbreaker.

* Apply compression for page_props, but not for the data in ParserOutput.
> We would have to introduce some kind of serialization mechanism into
> PageProps and LinksUpdate. Do we want to encourage this use of page_props?
>

IMO we don't want to. page_props is for page *properties*, not arbitrary
structured data. Also it's somewhat problematic in that it is per-page data
but it represents the result of a parse, so it doesn't necessarily match
the current revision, nor what a user with non-canonical parser options
sees. New features should probably use MCR for structured data.

* Introduce a dedicated database table for templatedata. Cleaner, but
> schema changes and data migration take a long time.
>

That seems like a decent solution to me, and probably the one I would pick
(unless there are more extensions in a similar situation). This is
secondary data so it doesn't really need to be migrated, just make
TemplateData write from the new table and fall back to the old one when
reading. Creating new tables should also not be time-consuming.

* Put templatedata into the BlobStore, and just the address into
> page_props. Makes loading slower, maybe even slower than the solution that
> relies on ParserCache.
>

Doesn't BlobStore support batch loading, unlike ParserCache?

* Convert TemplateData to MCR. This is the cleanest solution, but would
> require us to create an editing interface for templatedata, and migrate out
> existing data from wikitext. This is a long term perspective.
>

MCR has fairly different semantics from parser metadata. There are many
ways TemplateData data can be generated for a page without having a
<templatedata> tag in the wikitext (e.g. a doc subpage, or a template which
generates both documentation HTML and hidden TemplateData). Switching to
MCR should be thought of as a workflow adjustment for contributors, not
just a data migration.
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to