On 06/29/2015 09:19 AM, Brad Jorsch (Anomie) wrote:
On Thu, Jun 25, 2015 at 6:22 PM, Subramanya Sastry
ssas...@wikimedia.org mailto:ssas...@wikimedia.org wrote:
* Pare down rendering differences between the two systems so that
we can start thinking about using Parsoid HTML instead of
MWParser HTML
for read views. ( https://phabricator.wikimedia.org/T55784 )
Any hope of adding the Parsoid metadata to the MWParser HTML so
various fancy things can be done in core MediaWiki for smaller
installations instead of having to run a separate service? Or does
that fall under Make Parsoid redundant in its current complex avatar?
Short answer: the latter.
Long answer: read on.
Our immediate focus in the coming months would be to bring PHP parser
and Parsoid output closer. Some of that work would be to tweak Parsoid
output / CSS where required, but also to bring PHP parser output closer
to Parsoid output. https://gerrit.wikimedia.org/r/#/c/196532/ is one
step along those lines, for example. Scott has said he will review that
closely with this goal in mind. Another step is to get rid of Tidy and
use a HTML5 compliant tree builder similar to what Parsoid uses.
Beyond these initial steps, bringing the two together (both in terms of
output and functionality) will require bridging the computation models
... string-based vs. DOM-based. For example, we cannot really add
Parsoid-style metadata for templates to the PHP parser output without
being able to analyze the DOM -- that requires us to access the DOM
after Tidy (or the Tidy-replacement ideally) has a go at it. It requires
us to implement all the dirty tricks we implement to identify template
boundaries in the presence of unclosed tags, misnested tags, fostered
content from tables, and dom restructuring the HTML tree builder does to
comply with HTML5 semantics.
Besides that, if you want to also serialize this back to wikitext
without introducing dirty diffs (there is really no reason to do all
this extra work if you cannot also serialize it back to wikitext), you
also need to be able to either (a) maintain a lot of extra state in the
DOM beyond what Parsoid maintains, or (b) do all the additional work
that Parsoid does to maintain an extremely precise mapping between
wikitext strings and DOM trees. Once again, the only reason (b) is
complicated is because of unclosed tags, misnested tags, fostered
content, DOM restructuring because of HTML5 semantics.
There is a fair amount of complexity hidden there in those 2 steps, and
it really does not make sense to reimplement all of that in the PHP
parser. If you do, at that point, you've effectively reimplemented
Parsoid in PHP -- the PHP parser in its current form is unlikely to stay
as is.
So, the only real way out here is to move the wikitext computational
model closer to a DOM model. This is not a done deal really, but we have
talked about several ideas over the last couple years to move this
forward in increments. I don't want to go into a lot of detail in this
email since this is already getting lengthy, but I am happy to talk more
about it if there is interest.
To summarize, here are the steps as we see it:
* Bring PHP parser and Parsoid output as close as we can (replace Tidy,
fix PHP parser output wherever possible to be closer to Parsoid output).
* Incrementally move wikitext computational model to be DOM based using
Parsoid as the bridge that preserves compatibility. This is easier if we
have removed Tidy from the equation.
* Smoothen out the harder edge cases which simplifies the problem and
eliminates the complexity
* At this point, Parsoid current complexity will be unnecessary
(specifics dependent on previous steps) = you could have this
functionality back in PHP if it is so desired. But, by then, hopefully,
there will also be better clarity about mediawiki packaging that will
also influence this. Or, some small wikis might decide to be HTML-only
wikis.
Subbu.
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l