Re: [Wikitech-l] Parsoid announcement: Main roundtrip quality target achieved
On Fri, Jun 26, 2015 at 11:52 AM, Subramanya Sastry ssas...@wikimedia.org wrote: On 06/25/2015 06:29 PM, David Gerard wrote: On 25 June 2015 at 23:22, Subramanya Sastry ssas...@wikimedia.org wrote: On behalf of the parsing team, here is an update about Parsoid, the bidirectional wikitext - HTML parser that supports Visual Editor, Flow, and Content Translation. xcellent. How close are we to binning the PHP parser? (I realise that's a way off, but grant me my dreams.) The PHP parser used in production has 3 components: the preprocessor, the core parser, Tidy. Parsoid relies on the PHP preprocessor (access via the mediawiki API), so that part of the PHP parser will continue to be in operation. As noted in my update, we are working towards read views served by Parsoid HTML which requires several ducks to be lined up in a row. When that happens everywhere, the core PHP parser and Tidy will no longer be used. Do we have plans for avoiding code rot in unused the PHP parser code that would affect smaller third-party sites that don't using Parsoid? -- Brad Jorsch (Anomie) Software Engineer Wikimedia Foundation ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Parsoid announcement: Main roundtrip quality target achieved
On 06/29/2015 09:20 AM, Brad Jorsch (Anomie) wrote: On Fri, Jun 26, 2015 at 11:52 AM, Subramanya Sastry ssas...@wikimedia.org wrote: The PHP parser used in production has 3 components: the preprocessor, the core parser, Tidy. Parsoid relies on the PHP preprocessor (access via the mediawiki API), so that part of the PHP parser will continue to be in operation. As noted in my update, we are working towards read views served by Parsoid HTML which requires several ducks to be lined up in a row. When that happens everywhere, the core PHP parser and Tidy will no longer be used. Do we have plans for avoiding code rot in unused the PHP parser code that would affect smaller third-party sites that don't using Parsoid? My response to your other email covers quite a bit of this. As far as I have observed, the PHP parser code has been quite stable for a while. And, small third-party sites are unlikely to have complex requirements and are less likely to hit serious bugs. In any case, we'll do a good-faith effort to keep the PHP parser maintained and we'll fix critical and really high priority bugs. But, simply by virtue of us being a small team with multple reponsibilities, we will prioritize reducing complexity in Parsoid over keeping the PHP parser maintained. In the long run, I think that is a better path to bringing the two systems together. Subbu. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Parsoid announcement: Main roundtrip quality target achieved
On 06/25/2015 06:29 PM, David Gerard wrote: On 25 June 2015 at 23:22, Subramanya Sastry ssas...@wikimedia.org wrote: On behalf of the parsing team, here is an update about Parsoid, the bidirectional wikitext - HTML parser that supports Visual Editor, Flow, and Content Translation. xcellent. How close are we to binning the PHP parser? (I realise that's a way off, but grant me my dreams.) The PHP parser used in production has 3 components: the preprocessor, the core parser, Tidy. Parsoid relies on the PHP preprocessor (access via the mediawiki API), so that part of the PHP parser will continue to be in operation. As noted in my update, we are working towards read views served by Parsoid HTML which requires several ducks to be lined up in a row. When that happens everywhere, the core PHP parser and Tidy will no longer be used. However, I imagine your question is not so much about the PHP parser ... but more about wikitext and templating. Since I don't want to go off on a tangent here based on an assumption, maybe you can say more what you had in mind when you asked about binning the PHP parser. Subbu. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Parsoid announcement: Main roundtrip quality target achieved
I didn't have anything in mind, evidently I was just vague on what the stuff in there is and does :-) On 26 June 2015 at 16:52, Subramanya Sastry ssas...@wikimedia.org wrote: On 06/25/2015 06:29 PM, David Gerard wrote: On 25 June 2015 at 23:22, Subramanya Sastry ssas...@wikimedia.org wrote: On behalf of the parsing team, here is an update about Parsoid, the bidirectional wikitext - HTML parser that supports Visual Editor, Flow, and Content Translation. xcellent. How close are we to binning the PHP parser? (I realise that's a way off, but grant me my dreams.) The PHP parser used in production has 3 components: the preprocessor, the core parser, Tidy. Parsoid relies on the PHP preprocessor (access via the mediawiki API), so that part of the PHP parser will continue to be in operation. As noted in my update, we are working towards read views served by Parsoid HTML which requires several ducks to be lined up in a row. When that happens everywhere, the core PHP parser and Tidy will no longer be used. However, I imagine your question is not so much about the PHP parser ... but more about wikitext and templating. Since I don't want to go off on a tangent here based on an assumption, maybe you can say more what you had in mind when you asked about binning the PHP parser. Subbu. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
[Wikitech-l] Parsoid announcement: Main roundtrip quality target achieved
Hello everyone, On behalf of the parsing team, here is an update about Parsoid, the bidirectional wikitext - HTML parser that supports Visual Editor, Flow, and Content Translation. Subbu. --- TL:DR; 1. Parsoid[1] roundtrips 99.95% of the 158K pages in round-trip testing without introducing semantic diffs[2]. 2. With trivial simulated edits, the HTML - wikitext serializer used in production (selective serialization) introduces ZERO dirty diffs in 99.986% of those edits[3]. 10 of those 23 edits with dirty diffs are minor newline diffs. --- Couple days back (June 23rd), Parsoid achieved 99.95%[2] semantic accuracy in the wikitext - HTML - wikitext roundtripping process on the set of about 158K pages randomly picked from about 16 wikis back in 2013. Keeping this test set constant has let us monitor our progress over time. We were at 99.75% last year around this time. What does this mean? * Despite the practical complexities of wikitext, the mismatch in the processing models of wikitext (string-based) and Parsoid (DOM-based), and the various wikitext errors that are found on pages, Parsoid is able to maintain a reversible mapping between wikitext constructs and their equivalent HTML DOM trees that HTML editors and other tools can manipulate. The majority of differences in the 0.05% arise because of wikitext errors: links in links, 'fosterable'[4] content in tables, and some scenarios with unmatched quotes in attributes. Parsoid does not support round-tripping (RT) of these. * While this is not a big change from how it has been for about a year now in terms of Parsoid's support for editing, this is a notable milestone for us in terms of the confidence we have in Parsoid's ability to handle the wikitext usage seen in production wikis and our ability to RT them accurately without corrupting pages. This should also boost confidence of all applications that rely on Parsoid. * In production, Parsoid uses a selective serialization strategy which tries to preserve unedited parts of wikitext as far as possible. As part of regular testing, we also simulate a trivial edit by adding a new comment to the page and run the edited HTML through this selective serializer. All but 23 pages (0.014% of trivial edits) had ZERO dirty diffs[3]. Of these 23, 10 of the diffs were minor newline diffs. In production, the dirty diff rate will be higher than 0.014% because of more complex edits and because of bugs in any of 3 components involved in visual editing on Wikipedias (Parsoid, RESTBase[5] and Visual Editor) and their interaction. But, the base accuracy of Parsoid's roundtripping (both in terms of full and selective serialization) is critical to ensuring clean visual edits. The above milestones are part of ensuring that. What does this not mean? * If you edit one of those 0.05% of pages in VE, the VE-Parsoid combination will break the page. NO! If you edit the broken part of the page, Parsoid will very likely normalize the broken wikitext to the non-erroneous form (break up nested links, move fostered content out of the table, drop duplicate transclusion parameters, etc.) In the odd case, it could cause a dirty diff that changes the semantics of those broken constructs. * Parsoid's visual rendering is 99.95% identical to PHP parser rendering. NO! RT tests are focused on Parsoid's ability to support editing without introducing dirty diffs. Even though Parsoid might render a page differently than the default read view (and might even be incorrect), we are nevertheless able to RT it without breaking the wikitext. On the way to getting to 99.95% RT accuracy, we have improved and fixed several bugs in Parsoid's rendering. The rendering is also fairly identical to the default read view (otherwise, VE editors will definitely complain). However, we haven't done sufficient testing to systematically identify rendering incompatibilities and quantify this. In the coming quarters, we are going to turn our attention to this problem. We have a visual diffing infrastructure to help us with this (we take screenshots of Parsoid's output and the default output and compare those images and find diffs). We'll have to tweak and fix our visual-diffing setup and then fix rendering problems we find. * 100% roundtripping accuracy is within reach. NO! The reality is that there are a lot of pages out there that have various kinds of broken markup (mis-nested html tags, unmatched html tags, broken templates) in production. There are probably other edge case scenarios that trigger different behavior in Parsoid and the PHP parser. Because we go to great lengths in Parsoid to avoid dirty diffs, our selective serialization works
Re: [Wikitech-l] Parsoid announcement: Main roundtrip quality target achieved
On 25 June 2015 at 23:22, Subramanya Sastry ssas...@wikimedia.org wrote: On behalf of the parsing team, here is an update about Parsoid, the bidirectional wikitext - HTML parser that supports Visual Editor, Flow, and Content Translation. xcellent. How close are we to binning the PHP parser? (I realise that's a way off, but grant me my dreams.) - d. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l