Re: [Wikitech-l] [Engineering] Parsoid announcement: Main roundtrip quality target achieved

2015-06-29 Thread Brad Jorsch (Anomie)
On Thu, Jun 25, 2015 at 6:22 PM, Subramanya Sastry ssas...@wikimedia.org
wrote:

 * Pare down rendering differences between the two systems so that
   we can start thinking about using Parsoid HTML instead of MWParser HTML
   for read views. ( https://phabricator.wikimedia.org/T55784 )


Any hope of adding the Parsoid metadata to the MWParser HTML so various
fancy things can be done in core MediaWiki for smaller installations
instead of having to run a separate service? Or does that fall under Make
Parsoid redundant in its current complex avatar?

-- 
Brad Jorsch (Anomie)
Software Engineer
Wikimedia Foundation
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] [Engineering] Parsoid announcement: Main roundtrip quality target achieved

2015-06-29 Thread Subramanya Sastry

On 06/29/2015 09:19 AM, Brad Jorsch (Anomie) wrote:
On Thu, Jun 25, 2015 at 6:22 PM, Subramanya Sastry 
ssas...@wikimedia.org mailto:ssas...@wikimedia.org wrote:


* Pare down rendering differences between the two systems so that
  we can start thinking about using Parsoid HTML instead of
MWParser HTML
  for read views. ( https://phabricator.wikimedia.org/T55784 )


Any hope of adding the Parsoid metadata to the MWParser HTML so 
various fancy things can be done in core MediaWiki for smaller 
installations instead of having to run a separate service? Or does 
that fall under Make Parsoid redundant in its current complex avatar?


Short answer: the latter.
Long answer: read on.

Our immediate focus in the coming months would be to bring PHP parser 
and Parsoid output closer. Some of that work would be to tweak Parsoid 
output / CSS where required, but also to bring PHP parser output closer 
to Parsoid output. https://gerrit.wikimedia.org/r/#/c/196532/ is one 
step along those lines, for example. Scott has said he will review that 
closely with this goal in mind. Another step is to get rid of Tidy and 
use a HTML5 compliant tree builder similar to what Parsoid uses.


Beyond these initial steps, bringing the two together (both in terms of 
output and functionality) will require bridging the computation models 
... string-based vs. DOM-based. For example, we cannot really add 
Parsoid-style metadata for templates to the PHP parser output without 
being able to analyze the DOM -- that requires us to access the DOM 
after Tidy (or the Tidy-replacement ideally) has a go at it. It requires 
us to implement all the dirty tricks we implement to identify template 
boundaries in the presence of unclosed tags, misnested tags, fostered 
content from tables, and dom restructuring the HTML tree builder does to 
comply with HTML5 semantics.


Besides that, if you want to also serialize this back to wikitext 
without introducing dirty diffs (there is really no reason to do all 
this extra work if you cannot also serialize it back to wikitext), you 
also need to be able to either (a) maintain a lot of extra state in the 
DOM beyond what Parsoid maintains, or (b) do all the additional work 
that Parsoid does to maintain an extremely precise mapping between 
wikitext strings and DOM trees. Once again, the only reason (b) is 
complicated is because of unclosed tags, misnested tags, fostered 
content, DOM restructuring because of HTML5 semantics.


There is a fair amount of complexity hidden there in those 2 steps, and 
it really does not make sense to reimplement all of that in the PHP 
parser. If you do, at that point, you've effectively reimplemented 
Parsoid in PHP -- the PHP parser in its current form is unlikely to stay 
as is.


So, the only real way out here is to move the wikitext computational 
model closer to a DOM model. This is not a done deal really, but we have 
talked about several ideas over the last couple years to move this 
forward in increments. I don't want to go into a lot of detail in this 
email since this is already getting lengthy, but I am happy to talk more 
about it if there is interest.


To summarize, here are the steps as we see it:

* Bring PHP parser and Parsoid output as close as we can (replace Tidy, 
fix PHP parser output wherever possible to be closer to Parsoid output).
* Incrementally move wikitext computational model to be DOM based using 
Parsoid as the bridge that preserves compatibility. This is easier if we 
have removed Tidy from the equation.
* Smoothen out the harder edge cases which simplifies the problem and 
eliminates the complexity
* At this point, Parsoid current complexity will be unnecessary 
(specifics dependent on previous steps) = you could have this 
functionality back in PHP if it is so desired. But, by then, hopefully, 
there will also be better clarity about mediawiki packaging that will 
also influence this. Or, some small wikis might decide to be HTML-only 
wikis.


Subbu.
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] [Engineering] Parsoid announcement: Main roundtrip quality target achieved

2015-06-25 Thread Greg Grossmeier
quote name=Subramanya Sastry date=2015-06-25 time=17:22:53 -0500
 ---
 TL:DR;
 
 1. Parsoid[1] roundtrips 99.95% of the 158K pages in round-trip testing
without introducing semantic diffs[2].
 2. With trivial simulated edits, the HTML - wikitext serializer used
in production (selective serialization) introduces ZERO dirty diffs
in 99.986% of those edits[3]. 10 of those 23 edits with dirty diffs
are minor newline diffs.
 ---

Huge congrats, Subbu and team!

-- 
| Greg GrossmeierGPG: B2FA 27B1 F7EB D327 6B8E |
| identi.ca: @gregA18D 1138 8E47 FAC8 1C7D |

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] [Engineering] Parsoid announcement: Main roundtrip quality target achieved

2015-06-25 Thread James Forrester
On 25 June 2015 at 15:22, Subramanya Sastry ssas...@wikimedia.org wrote:

 TL:DR;

 1. Parsoid[1] roundtrips 99.95% of the 158K pages in round-trip testing
without introducing semantic diffs[2].
 2. With trivial simulated edits, the HTML - wikitext serializer used
in production (selective serialization) introduces ZERO dirty diffs
in 99.986% of those edits[3]. 10 of those 23 edits with dirty diffs
are minor newline diffs.


​Subbu,

You and your team have done, and keep on doing, amazing stuff. Thank you
all so very much. Congratulations doesn't come close. :-)

Yours,
-- 
James D. Forrester
Lead Product Manager, Editing
Wikimedia Foundation, Inc.

jforres...@wikimedia.org | @jdforrester
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] [Engineering] Parsoid announcement: Main roundtrip quality target achieved

2015-06-25 Thread Ori Livneh
On Thu, Jun 25, 2015 at 3:22 PM, Subramanya Sastry ssas...@wikimedia.org
wrote:

 Hello everyone,

 On behalf of the parsing team, here is an update about Parsoid, the
 bidirectional wikitext - HTML parser that supports  Visual Editor, Flow,
 and Content Translation.

 Subbu.

 ---
 TL:DR;

 1. Parsoid[1] roundtrips 99.95% of the 158K pages in round-trip testing
without introducing semantic diffs[2].


Congratulations, parsing team. This is very cool.


...and, pssst, wink wink, nudge nudge, etc:
http://cacm.acm.org/about-communications/author-center/author-guidelines
http://queue.acm.org/author_guidelines.cfm

:)
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l