Re: [Wikitech-l] Parsoid announcement: Main roundtrip quality target achieved

2015-06-29 Thread Brad Jorsch (Anomie)
On Fri, Jun 26, 2015 at 11:52 AM, Subramanya Sastry ssas...@wikimedia.org
wrote:

 On 06/25/2015 06:29 PM, David Gerard wrote:

 On 25 June 2015 at 23:22, Subramanya Sastry ssas...@wikimedia.org
 wrote:

  On behalf of the parsing team, here is an update about Parsoid, the
 bidirectional wikitext - HTML parser that supports  Visual Editor,
 Flow,
 and Content Translation.

 xcellent. How close are we to binning the PHP parser? (I realise
 that's a way off, but grant me my dreams.)


 The PHP parser used in production has 3 components: the preprocessor,
 the core parser, Tidy. Parsoid relies on the PHP preprocessor (access via
 the mediawiki API), so that part of the PHP parser will continue to be in
 operation.

 As noted in my update, we are working towards read views served by Parsoid
 HTML which requires several ducks to be lined up in a row. When that
 happens everywhere, the core PHP parser and Tidy will no longer be used.


Do we have plans for avoiding code rot in unused the PHP parser code that
would affect smaller third-party sites that don't using Parsoid?


-- 
Brad Jorsch (Anomie)
Software Engineer
Wikimedia Foundation
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Parsoid announcement: Main roundtrip quality target achieved

2015-06-29 Thread Subramanya Sastry

On 06/29/2015 09:20 AM, Brad Jorsch (Anomie) wrote:

On Fri, Jun 26, 2015 at 11:52 AM, Subramanya Sastry ssas...@wikimedia.org
wrote:


The PHP parser used in production has 3 components: the preprocessor,
the core parser, Tidy. Parsoid relies on the PHP preprocessor (access via
the mediawiki API), so that part of the PHP parser will continue to be in
operation.

As noted in my update, we are working towards read views served by Parsoid
HTML which requires several ducks to be lined up in a row. When that
happens everywhere, the core PHP parser and Tidy will no longer be used.

Do we have plans for avoiding code rot in unused the PHP parser code that
would affect smaller third-party sites that don't using Parsoid?


My response to your other email covers quite a bit of this.

As far as I have observed, the PHP parser code has been quite stable for 
a while. And, small third-party sites are unlikely to have complex 
requirements and are less likely to hit serious bugs. In any case, we'll 
do a good-faith effort to keep the PHP parser maintained and we'll fix 
critical and really high priority bugs. But, simply by virtue of us 
being a small team with multple reponsibilities, we will prioritize 
reducing complexity in Parsoid over keeping the PHP parser maintained. 
In the long run, I think that is a better path to bringing the two 
systems together.


Subbu.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Parsoid announcement: Main roundtrip quality target achieved

2015-06-26 Thread Subramanya Sastry

On 06/25/2015 06:29 PM, David Gerard wrote:

On 25 June 2015 at 23:22, Subramanya Sastry ssas...@wikimedia.org wrote:


On behalf of the parsing team, here is an update about Parsoid, the
bidirectional wikitext - HTML parser that supports  Visual Editor, Flow,
and Content Translation.

xcellent. How close are we to binning the PHP parser? (I realise
that's a way off, but grant me my dreams.)


The PHP parser used in production has 3 components: the preprocessor, 
the core parser, Tidy. Parsoid relies on the PHP preprocessor (access 
via the mediawiki API), so that part of the PHP parser will continue to 
be in operation.


As noted in my update, we are working towards read views served by 
Parsoid HTML which requires several ducks to be lined up in a row. When 
that happens everywhere, the core PHP parser and Tidy will no longer be 
used.


However, I imagine your question is not so much about the PHP parser ... 
but more about wikitext and templating. Since I don't want to go off on 
a tangent here based on an assumption, maybe you can say more what you 
had in mind when you asked about binning the PHP parser.


Subbu.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Parsoid announcement: Main roundtrip quality target achieved

2015-06-26 Thread David Gerard
I didn't have anything in mind, evidently I was just vague on what the
stuff in there is and does :-)

On 26 June 2015 at 16:52, Subramanya Sastry ssas...@wikimedia.org wrote:
 On 06/25/2015 06:29 PM, David Gerard wrote:

 On 25 June 2015 at 23:22, Subramanya Sastry ssas...@wikimedia.org wrote:

 On behalf of the parsing team, here is an update about Parsoid, the
 bidirectional wikitext - HTML parser that supports  Visual Editor,
 Flow,
 and Content Translation.

 xcellent. How close are we to binning the PHP parser? (I realise
 that's a way off, but grant me my dreams.)


 The PHP parser used in production has 3 components: the preprocessor, the
 core parser, Tidy. Parsoid relies on the PHP preprocessor (access via the
 mediawiki API), so that part of the PHP parser will continue to be in
 operation.

 As noted in my update, we are working towards read views served by Parsoid
 HTML which requires several ducks to be lined up in a row. When that happens
 everywhere, the core PHP parser and Tidy will no longer be used.

 However, I imagine your question is not so much about the PHP parser ... but
 more about wikitext and templating. Since I don't want to go off on a
 tangent here based on an assumption, maybe you can say more what you had in
 mind when you asked about binning the PHP parser.

 Subbu.


 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] Parsoid announcement: Main roundtrip quality target achieved

2015-06-25 Thread Subramanya Sastry

Hello everyone,

On behalf of the parsing team, here is an update about Parsoid, the 
bidirectional wikitext - HTML parser that supports  Visual Editor, 
Flow, and Content Translation.


Subbu.

---
TL:DR;

1. Parsoid[1] roundtrips 99.95% of the 158K pages in round-trip testing
   without introducing semantic diffs[2].
2. With trivial simulated edits, the HTML - wikitext serializer used
   in production (selective serialization) introduces ZERO dirty diffs
   in 99.986% of those edits[3]. 10 of those 23 edits with dirty diffs
   are minor newline diffs.
---

Couple days back (June 23rd), Parsoid achieved 99.95%[2] semantic accuracy
in the wikitext - HTML - wikitext roundtripping process on the set of
about 158K pages randomly picked from about 16 wikis back in 2013.
Keeping this test set constant has let us monitor our progress over time.
We were at 99.75% last year around this time.

What does this mean?

* Despite the practical complexities of wikitext, the mismatch in the
  processing models of wikitext (string-based) and Parsoid (DOM-based),
  and the various wikitext errors that are found on pages, Parsoid is 
able

  to maintain a reversible mapping between wikitext constructs and their
  equivalent HTML DOM trees that HTML editors and other tools can 
manipulate.


  The majority of differences in the 0.05% arise because of wikitext 
errors:

  links in links, 'fosterable'[4] content in tables, and some scenarios
  with unmatched quotes in attributes. Parsoid does not support
  round-tripping (RT) of these.

* While this is not a big change from how it has been for about a year now
  in terms of Parsoid's support for editing, this is a notable milestone
  for us in terms of the confidence we have in Parsoid's ability to handle
  the wikitext usage seen in production wikis and our ability to RT them
  accurately without corrupting pages. This should also boost confidence
  of all applications that rely on Parsoid.

* In production, Parsoid uses a selective serialization strategy which
  tries to preserve unedited parts of wikitext as far as possible.

  As part of regular testing, we also simulate a trivial edit by adding
  a new comment to the page and run the edited HTML through this
  selective serializer. All but 23 pages (0.014% of trivial edits) had
  ZERO dirty diffs[3]. Of these 23, 10 of the diffs were minor newline 
diffs.


  In production, the dirty diff rate will be higher than 0.014% because of
  more complex edits and because of bugs in any of 3 components involved
  in visual editing on Wikipedias (Parsoid, RESTBase[5] and Visual Editor)
  and their interaction. But, the base accuracy of Parsoid's roundtripping
  (both in terms of full and selective serialization) is critical to 
ensuring

  clean visual edits. The above milestones are part of ensuring that.

What does this not mean?

* If you edit one of those 0.05% of pages in VE, the VE-Parsoid combination
  will break the page. NO!

  If you edit the broken part of the page, Parsoid will very likely 
normalize

  the broken wikitext to the non-erroneous form (break up nested links,
  move fostered content out of the table, drop duplicate transclusion
  parameters, etc.) In the odd case, it could cause a dirty diff that 
changes

  the semantics of those broken constructs.

* Parsoid's visual rendering is 99.95% identical to PHP parser 
rendering. NO!


  RT tests are focused on Parsoid's ability to support editing without
  introducing dirty diffs. Even though Parsoid might render a page
  differently than the default read view (and might even be incorrect),
  we are nevertheless able to RT it without breaking the wikitext.

  On the way to getting to 99.95% RT accuracy, we have improved and fixed
  several bugs in Parsoid's rendering. The rendering is also fairly 
identical
  to the default read view (otherwise, VE editors will definitely 
complain).

  However, we haven't done sufficient testing to systematically identify
  rendering incompatibilities and quantify this. In the coming quarters,
  we are going to turn our attention to this problem. We have a visual
  diffing infrastructure to help us with this (we take screenshots of
  Parsoid's output and the default output and compare those images and find
  diffs). We'll have to tweak and fix our visual-diffing setup and then fix
  rendering problems we find.

* 100% roundtripping accuracy is within reach. NO!

  The reality is that there are a lot of pages out there that have various
  kinds of broken markup (mis-nested html tags, unmatched html tags,
  broken templates) in production. There are probably other edge case
  scenarios that trigger different behavior in Parsoid and the PHP parser.
  Because we go to great lengths in Parsoid to avoid dirty diffs, our
  selective serialization works 

Re: [Wikitech-l] Parsoid announcement: Main roundtrip quality target achieved

2015-06-25 Thread David Gerard
On 25 June 2015 at 23:22, Subramanya Sastry ssas...@wikimedia.org wrote:

 On behalf of the parsing team, here is an update about Parsoid, the
 bidirectional wikitext - HTML parser that supports  Visual Editor, Flow,
 and Content Translation.

xcellent. How close are we to binning the PHP parser? (I realise
that's a way off, but grant me my dreams.)


- d.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l