Re: [Wikitech-l] Parsoid's progress
I believe Subbu will follow up with a more complete response, but I'll note that: 1) no plan survives first encounter with the enemy. Parsoid was going to be simpler than the PHP parser, Parsoid was going to be written in PHP, then C, then prototyped in JS for a later implementation in C, etc. It has varied over time as we learned more about the problem. It is currently written in node.js and probably is at least the same order of complexity as the existing PHP parser. It is, however, built on slightly more solid foundations, so its behavior is more regular than the PHP parser in many places -- although I've been submitting patches to the core parser where necessary to try to bring them closer together. (c.f. https://gerrit.wikimedia.org/r/180982 for the most recent of these.) And, of course, Parsoid emits well-formed HTML which can be round-tripped. In many cases Parsoid could be greatly simplified if we didn't have to maintain compatibility with various strange corner cases in the PHP parser. 2) Parsoid contains a partial implementation of the PHP expandtemplates module. It was decided (I think wisely) that we didn't really gain anything by trying to reimplement this on the Parsoid side, though, and it was better to use the existing PHP code via api.php. The alternative would be to basically reimplement quite a lot of mediawiki (lua embedding, the various parser functions extensions, etc) in node.js. This *could* be done -- there is no technical reason why it cannot -- but nobody thinks it's a good idea to spend time on right now. But the expandtemplates stuff basically works. As I said, it doesn't contain all the crazy extensions that we use on the main WMF sites, but it would be reasonable to turn it on for a smaller stock mediawiki instance. In that sense it *could* be a full replacement for the Parser. But note that even as a full parser replacement Parsoid depends on the PHP API in a large number of ways: imageinfo, siteinfo, language information, localized keywords for images, etc. The idea of independence is somewhat vague. --scott On Mon, Jan 19, 2015 at 11:58 PM, MZMcBride z...@mzmcbride.com wrote: Matthew Flaschen wrote: On 01/19/2015 08:15 AM, MZMcBride wrote: And from this question flows another: why is Parsoid calling MediaWiki's api.php so regularly? I think it uses it for some aspects of templates and hooks. I'm sure the Parsoid team could explain further. I've been discussing Parsoid a bit and there's apparently an important distinction between the preprocessor(s) and the parser. Though in practice I think parser is used pretty generically. Further notes follow. I'm told in Parsoid, ref and {{!}} are special-cased, while most other parser functions require using the expandtemplates module of MediaWiki's api.php. As I understand it, calling out to api.php is intended to be a permanent solution (I thought it might be a temporary shim). If the goal was to just add more verbose markup to parser output, couldn't we just have done that (in PHP)? Node.js was chosen over PHP due to speed/performance considerations and concerns, from what I now understand. The view that Parsoid is going to replace the PHP parser seems to be overly simplistic and goes back to the distinction between the parser and preprocessor. Full wikitext transformation seems to require a preprocessor. MZMcBride ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l -- (http://cscott.net) ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Parsoid's progress
Given what I've seen so far, it might be best to aim for a gradual reimplementation of Parsoid features to make most of them work without a need for Parsoid, with the eventual goal of severing the need for Parsoid completely if possible. At any rate, the less the parser has to outsource, the less complicated things will be, correct? Date: Tue, 20 Jan 2015 11:02:10 -0500 From: canan...@wikimedia.org To: wikitech-l@lists.wikimedia.org Subject: Re: [Wikitech-l] Parsoid's progress I believe Subbu will follow up with a more complete response, but I'll note that: 1) no plan survives first encounter with the enemy. Parsoid was going to be simpler than the PHP parser, Parsoid was going to be written in PHP, then C, then prototyped in JS for a later implementation in C, etc. It has varied over time as we learned more about the problem. It is currently written in node.js and probably is at least the same order of complexity as the existing PHP parser. It is, however, built on slightly more solid foundations, so its behavior is more regular than the PHP parser in many places -- although I've been submitting patches to the core parser where necessary to try to bring them closer together. (c.f. https://gerrit.wikimedia.org/r/180982 for the most recent of these.) And, of course, Parsoid emits well-formed HTML which can be round-tripped. In many cases Parsoid could be greatly simplified if we didn't have to maintain compatibility with various strange corner cases in the PHP parser. 2) Parsoid contains a partial implementation of the PHP expandtemplates module. It was decided (I think wisely) that we didn't really gain anything by trying to reimplement this on the Parsoid side, though, and it was better to use the existing PHP code via api.php. The alternative would be to basically reimplement quite a lot of mediawiki (lua embedding, the various parser functions extensions, etc) in node.js. This *could* be done -- there is no technical reason why it cannot -- but nobody thinks it's a good idea to spend time on right now. But the expandtemplates stuff basically works. As I said, it doesn't contain all the crazy extensions that we use on the main WMF sites, but it would be reasonable to turn it on for a smaller stock mediawiki instance. In that sense it *could* be a full replacement for the Parser. But note that even as a full parser replacement Parsoid depends on the PHP API in a large number of ways: imageinfo, siteinfo, language information, localized keywords for images, etc. The idea of independence is somewhat vague. --scott On Mon, Jan 19, 2015 at 11:58 PM, MZMcBride z...@mzmcbride.com wrote: Matthew Flaschen wrote: On 01/19/2015 08:15 AM, MZMcBride wrote: And from this question flows another: why is Parsoid calling MediaWiki's api.php so regularly? I think it uses it for some aspects of templates and hooks. I'm sure the Parsoid team could explain further. I've been discussing Parsoid a bit and there's apparently an important distinction between the preprocessor(s) and the parser. Though in practice I think parser is used pretty generically. Further notes follow. I'm told in Parsoid, ref and {{!}} are special-cased, while most other parser functions require using the expandtemplates module of MediaWiki's api.php. As I understand it, calling out to api.php is intended to be a permanent solution (I thought it might be a temporary shim). If the goal was to just add more verbose markup to parser output, couldn't we just have done that (in PHP)? Node.js was chosen over PHP due to speed/performance considerations and concerns, from what I now understand. The view that Parsoid is going to replace the PHP parser seems to be overly simplistic and goes back to the distinction between the parser and preprocessor. Full wikitext transformation seems to require a preprocessor. MZMcBride ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l -- (http://cscott.net) ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Parsoid's progress
C. Scott Ananian wrote: 1) no plan survives first encounter with the enemy. Parsoid was going to be simpler than the PHP parser, Parsoid was going to be written in PHP, then C, then prototyped in JS for a later implementation in C, etc. It has varied over time as we learned more about the problem. It is currently written in node.js and probably is at least the same order of complexity as the existing PHP parser. Hrm. In many cases Parsoid could be greatly simplified if we didn't have to maintain compatibility with various strange corner cases in the PHP parser. I guess this is the part that I'm still struggling with. If the PHP parser is/was already doing the job of converting to wikitext to HTML, why would that need to be rewritten in Node.js? Wouldn't it have been simpler to make the HTML output more verbose in the PHP parser so that it could cleanly round-trip? I'm still not clear where Node.js (or C or JavaScript) came into this. I heard there were performance concerns with the PHP parser. Was that the case? I'm mostly just curious... you can't un-milk the cow, as they say. But note that even as a full parser replacement Parsoid depends on the PHP API in a large number of ways: imageinfo, siteinfo, language information, localized keywords for images, etc. The idea of independence is somewhat vague. Hrm. MZMcBride ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Parsoid's progress
Thank you both for the detailed replies. They were very helpful and I feel like I have a better understanding now. I'm still trying to wrap my head around Parsoid, its implementation, and how it fits in with the larger future of MediaWiki development. Subramanya Sastry wrote: The core parser has the following components: * preprocessing that expands transclusions, extensions (including Scribunto), parser functions, include directives, etc. to wikitext * wikitext parser that converts wikitext to html * Tidy that runs on the html produced by wikitext parser and fixes up malformed html Parsoid right now replaces the last two of the three components, but in a way that enables all of the functionality stated earlier. Are you saying Parsoid can act as a replacement for HTMLTidy? That seems like a pretty huge win. Replacing Tidy has been a longstanding goal: https://phabricator.wikimedia.org/T4542. But, there are several directions this can go from here (including implementing a preprocessor in Parsoid, for example). However, note that this discussion is not entirely about Parsoid but also about shared hosting support, mediawiki packaging, pure PHP mediawiki install, HTML-only wikis, etc. All those other decisions inform what Parsoid should focus on and how it evolves. I think this is very well put. There's definitely a lot to think about. MZMcBride ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Parsoid's progress
Exactly: The HTML to wikitext conversion is what makes Parsoid useful, and not only for VE. Thanks to Parsoid, ContentTranslation has a simple rich text editor with contenteditable (not a full VE, though this may change in the future). We are just starting to deploy it to production, but the users who tested in beta labs loved it. The Parsoid way of converting wikitext to HTML is useful, too, because it allows ContentTranslation to process the article that is being translated in a formal and expected way, understanding where are links, images, templates, timelines, references, etc., and adapting it automatically to the translated article. All of this is done with simple jQuery selectors and very little effort. בתאריך 20 בינו 2015 04:01, Matthew Flaschen mflasc...@wikimedia.org כתב: On 01/19/2015 08:15 AM, MZMcBride wrote: Currently Parsoid is the largest client of the MediaWiki PHP parser, I'm told. If Parsoid is regularly calling and relying upon the MediaWiki PHP parser, what exactly is the point of Parsoid? Parsoid can go: wikitext = HTML = wikitext The MediaWiki parser can only go: wikitext = HTML The most important part of Parsoid is thus the HTML = wikitext conversion (required for VisualEditor), but other parts of their architecture follow from that. And from this question flows another: why is Parsoid calling MediaWiki's api.php so regularly? I think it uses it for some aspects of templates and hooks. I'm sure the Parsoid team could explain further. Matt Flaschen ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Parsoid's progress
On 01/19/2015 08:15 AM, MZMcBride wrote: Currently Parsoid is the largest client of the MediaWiki PHP parser, I'm told. If Parsoid is regularly calling and relying upon the MediaWiki PHP parser, what exactly is the point of Parsoid? Parsoid can go: wikitext = HTML = wikitext The MediaWiki parser can only go: wikitext = HTML The most important part of Parsoid is thus the HTML = wikitext conversion (required for VisualEditor), but other parts of their architecture follow from that. And from this question flows another: why is Parsoid calling MediaWiki's api.php so regularly? I think it uses it for some aspects of templates and hooks. I'm sure the Parsoid team could explain further. Matt Flaschen ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Parsoid's progress
Matthew Flaschen wrote: On 01/19/2015 08:15 AM, MZMcBride wrote: And from this question flows another: why is Parsoid calling MediaWiki's api.php so regularly? I think it uses it for some aspects of templates and hooks. I'm sure the Parsoid team could explain further. I've been discussing Parsoid a bit and there's apparently an important distinction between the preprocessor(s) and the parser. Though in practice I think parser is used pretty generically. Further notes follow. I'm told in Parsoid, ref and {{!}} are special-cased, while most other parser functions require using the expandtemplates module of MediaWiki's api.php. As I understand it, calling out to api.php is intended to be a permanent solution (I thought it might be a temporary shim). If the goal was to just add more verbose markup to parser output, couldn't we just have done that (in PHP)? Node.js was chosen over PHP due to speed/performance considerations and concerns, from what I now understand. The view that Parsoid is going to replace the PHP parser seems to be overly simplistic and goes back to the distinction between the parser and preprocessor. Full wikitext transformation seems to require a preprocessor. MZMcBride ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Parsoid's progress
If I might weigh in, I concur with MZMcBride. If Parsoid is absolutely needed regardless, that's one thing, but if a VE editing interface can be set up that doesn't need Parsoid, that would reduce dependence on third party software, make installation easier for all parties concerned, and not be as resource intensive, and since the optimization of resources is always a plus IMO, severing dependency on Parsoid and attempting to do it's current functions purely in house seems like a good plan to pursue. Date: Mon, 19 Jan 2015 11:15:54 -0500 From: z...@mzmcbride.com To: wikitech-l@lists.wikimedia.org Subject: [Wikitech-l] Parsoid's progress (Combining pieces of Jay's thread and pieces of the shared hosting thread.) Daniel Friesen wrote: Parsoid can do Parsoid DOM to WikiText conversions. So I believe the suggestion is that storage be switched entirely to the Parsoid DOM and WikiText in classic editing just becomes a method of editing the content that is stored as Parsoid DOM in the backend. Tim Starling wrote: Parsoid depends on the MediaWiki parser, it calls it via api.php. It's not a complete, standalone implementation of wikitext to HTML transformation. HTML storage would be a pretty simple feature, and would allow third-party users to use VE without Parsoid. It's not so simple to use Parsoid without the MediaWiki parser, especially if you want to support all existing extensions. So, as currently proposed, HTML storage is actually a way to reduce the dependency on services for non-WMF wikis, not to increase it. Based on recent comments from Gabriel and Subbu, my understanding is that there are no plans to drop the MediaWiki parser at the moment. Yeah... what is this all about? My understanding (and please correct me if I'm wrong) is that Parsoid is/was intended to be a standalone service capable of translating wikitext -- HTML. You seem to be stating that Parsoid is neither complete nor standalone. Why? Currently Parsoid is the largest client of the MediaWiki PHP parser, I'm told. If Parsoid is regularly calling and relying upon the MediaWiki PHP parser, what exactly is the point of Parsoid? How much parity is there between Parsoid without the use of the MediaWiki parser and the MediaWiki parser? That is, if you selected a random sample of pages from a Wikimedia wiki, how many of them could Parsoid correctly parse on its own? And from this question flows another: why is Parsoid calling MediaWiki's api.php so regularly? I'm also interested in Parsoid's development as it relates to the broader push for services. If Parsoid is going to be the model of future services development, I'd like a clearer evaluation of what kind of model it is. Again, please correct me if I'm wrong, mistaken, misinformed, etc., but from my place of limited knowledge, it sounds very unappealing to create large Node.js applications (services) that closely tie in and require(!) PHP counterparts. This seems like the opposite of moving toward a more flexible, modular architecture. From my perspective, it would seem to only saddle us with additional technical debt moving forward, as we double complexity indefinitely. MZMcBride ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l