Re: [Wikitech-l] Parsoid's progress

2015-01-20 Thread C. Scott Ananian
I believe Subbu will follow up with a more complete response, but I'll note
that:

1) no plan survives first encounter with the enemy.  Parsoid was going to
be simpler than the PHP parser, Parsoid was going to be written in PHP,
then C, then prototyped in JS for a later implementation in C, etc.  It has
varied over time as we learned more about the problem.  It is currently
written in node.js and probably is at least the same order of complexity as
the existing PHP parser.  It is, however, built on slightly more solid
foundations, so its behavior is more regular than the PHP parser in many
places -- although I've been submitting patches to the core parser where
necessary to try to bring them closer together.  (c.f.
https://gerrit.wikimedia.org/r/180982 for the most recent of these.)  And,
of course, Parsoid emits well-formed HTML which can be round-tripped.

In many cases Parsoid could be greatly simplified if we didn't have to
maintain compatibility with various strange corner cases in the PHP parser.

2) Parsoid contains a partial implementation of the PHP expandtemplates
module.  It was decided (I think wisely) that we didn't really gain
anything by trying to reimplement this on the Parsoid side, though, and it
was better to use the existing PHP code via api.php.  The alternative would
be to basically reimplement quite a lot of mediawiki (lua embedding, the
various parser functions extensions, etc) in node.js.  This *could* be done
-- there is no technical reason why it cannot -- but nobody thinks it's a
good idea to spend time on right now.

But the expandtemplates stuff basically works.   As I said, it doesn't
contain all the crazy extensions that we use on the main WMF sites, but it
would be reasonable to turn it on for a smaller stock mediawiki instance.
In that sense it *could* be a full replacement for the Parser.

But note that even as a full parser replacement Parsoid depends on the PHP
API in a large number of ways: imageinfo, siteinfo, language information,
localized keywords for images, etc.  The idea of independence is somewhat
vague.
  --scott


On Mon, Jan 19, 2015 at 11:58 PM, MZMcBride z...@mzmcbride.com wrote:

 Matthew Flaschen wrote:
 On 01/19/2015 08:15 AM, MZMcBride wrote:
  And from this question flows another: why is Parsoid
  calling MediaWiki's api.php so regularly?
 
 I think it uses it for some aspects of templates and hooks.  I'm sure
 the Parsoid team could explain further.

 I've been discussing Parsoid a bit and there's apparently an important
 distinction between the preprocessor(s) and the parser. Though in practice
 I think parser is used pretty generically. Further notes follow.

 I'm told in Parsoid, ref and {{!}} are special-cased, while most other
 parser functions require using the expandtemplates module of MediaWiki's
 api.php. As I understand it, calling out to api.php is intended to be a
 permanent solution (I thought it might be a temporary shim).

 If the goal was to just add more verbose markup to parser output, couldn't
 we just have done that (in PHP)? Node.js was chosen over PHP due to
 speed/performance considerations and concerns, from what I now understand.

 The view that Parsoid is going to replace the PHP parser seems to be
 overly simplistic and goes back to the distinction between the parser and
 preprocessor. Full wikitext transformation seems to require a preprocessor.

 MZMcBride



 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l




-- 
(http://cscott.net)
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Parsoid's progress

2015-01-20 Thread Arcane 21
Given what I've seen so far, it might be best to aim for a gradual 
reimplementation of Parsoid features to make most of them work without a need 
for Parsoid, with the eventual goal of severing the need for Parsoid completely 
if possible. At any rate, the less the parser has to outsource, the less 
complicated things will be, correct?

 Date: Tue, 20 Jan 2015 11:02:10 -0500
 From: canan...@wikimedia.org
 To: wikitech-l@lists.wikimedia.org
 Subject: Re: [Wikitech-l] Parsoid's progress
 
 I believe Subbu will follow up with a more complete response, but I'll note
 that:
 
 1) no plan survives first encounter with the enemy.  Parsoid was going to
 be simpler than the PHP parser, Parsoid was going to be written in PHP,
 then C, then prototyped in JS for a later implementation in C, etc.  It has
 varied over time as we learned more about the problem.  It is currently
 written in node.js and probably is at least the same order of complexity as
 the existing PHP parser.  It is, however, built on slightly more solid
 foundations, so its behavior is more regular than the PHP parser in many
 places -- although I've been submitting patches to the core parser where
 necessary to try to bring them closer together.  (c.f.
 https://gerrit.wikimedia.org/r/180982 for the most recent of these.)  And,
 of course, Parsoid emits well-formed HTML which can be round-tripped.
 
 In many cases Parsoid could be greatly simplified if we didn't have to
 maintain compatibility with various strange corner cases in the PHP parser.
 
 2) Parsoid contains a partial implementation of the PHP expandtemplates
 module.  It was decided (I think wisely) that we didn't really gain
 anything by trying to reimplement this on the Parsoid side, though, and it
 was better to use the existing PHP code via api.php.  The alternative would
 be to basically reimplement quite a lot of mediawiki (lua embedding, the
 various parser functions extensions, etc) in node.js.  This *could* be done
 -- there is no technical reason why it cannot -- but nobody thinks it's a
 good idea to spend time on right now.
 
 But the expandtemplates stuff basically works.   As I said, it doesn't
 contain all the crazy extensions that we use on the main WMF sites, but it
 would be reasonable to turn it on for a smaller stock mediawiki instance.
 In that sense it *could* be a full replacement for the Parser.
 
 But note that even as a full parser replacement Parsoid depends on the PHP
 API in a large number of ways: imageinfo, siteinfo, language information,
 localized keywords for images, etc.  The idea of independence is somewhat
 vague.
   --scott
 
 
 On Mon, Jan 19, 2015 at 11:58 PM, MZMcBride z...@mzmcbride.com wrote:
 
  Matthew Flaschen wrote:
  On 01/19/2015 08:15 AM, MZMcBride wrote:
   And from this question flows another: why is Parsoid
   calling MediaWiki's api.php so regularly?
  
  I think it uses it for some aspects of templates and hooks.  I'm sure
  the Parsoid team could explain further.
 
  I've been discussing Parsoid a bit and there's apparently an important
  distinction between the preprocessor(s) and the parser. Though in practice
  I think parser is used pretty generically. Further notes follow.
 
  I'm told in Parsoid, ref and {{!}} are special-cased, while most other
  parser functions require using the expandtemplates module of MediaWiki's
  api.php. As I understand it, calling out to api.php is intended to be a
  permanent solution (I thought it might be a temporary shim).
 
  If the goal was to just add more verbose markup to parser output, couldn't
  we just have done that (in PHP)? Node.js was chosen over PHP due to
  speed/performance considerations and concerns, from what I now understand.
 
  The view that Parsoid is going to replace the PHP parser seems to be
  overly simplistic and goes back to the distinction between the parser and
  preprocessor. Full wikitext transformation seems to require a preprocessor.
 
  MZMcBride
 
 
 
  ___
  Wikitech-l mailing list
  Wikitech-l@lists.wikimedia.org
  https://lists.wikimedia.org/mailman/listinfo/wikitech-l
 
 
 
 
 -- 
 (http://cscott.net)
 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l
  
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Parsoid's progress

2015-01-20 Thread MZMcBride
C. Scott Ananian wrote:
1) no plan survives first encounter with the enemy.  Parsoid was going to
be simpler than the PHP parser, Parsoid was going to be written in PHP,
then C, then prototyped in JS for a later implementation in C, etc.  It
has varied over time as we learned more about the problem.  It is
currently written in node.js and probably is at least the same order of
complexity as the existing PHP parser.

Hrm.

In many cases Parsoid could be greatly simplified if we didn't have to
maintain compatibility with various strange corner cases in the PHP
parser.

I guess this is the part that I'm still struggling with. If the PHP parser
is/was already doing the job of converting to wikitext to HTML, why would
that need to be rewritten in Node.js? Wouldn't it have been simpler to
make the HTML output more verbose in the PHP parser so that it could
cleanly round-trip? I'm still not clear where Node.js (or C or JavaScript)
came into this. I heard there were performance concerns with the PHP
parser. Was that the case?

I'm mostly just curious... you can't un-milk the cow, as they say.

But note that even as a full parser replacement Parsoid depends on the PHP
API in a large number of ways: imageinfo, siteinfo, language information,
localized keywords for images, etc.  The idea of independence is
somewhat vague.

Hrm.

MZMcBride



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Parsoid's progress

2015-01-20 Thread MZMcBride
Thank you both for the detailed replies. They were very helpful and I feel
like I have a better understanding now. I'm still trying to wrap my head
around Parsoid, its implementation, and how it fits in with the larger
future of MediaWiki development.

Subramanya Sastry wrote:
The core parser has the following components:

* preprocessing that expands transclusions, extensions (including
  Scribunto), parser functions, include directives, etc. to wikitext
* wikitext parser that converts wikitext to html
* Tidy that runs on the html produced by wikitext parser and fixes up
  malformed html

Parsoid right now replaces the last two of the three components, but in
a way that enables all of the functionality stated earlier.

Are you saying Parsoid can act as a replacement for HTMLTidy? That seems
like a pretty huge win. Replacing Tidy has been a longstanding goal:
https://phabricator.wikimedia.org/T4542.

But, there are several directions this can go from here (including
implementing a preprocessor in Parsoid, for example). However, note that
this discussion is not entirely about Parsoid but also about shared
hosting support, mediawiki packaging, pure PHP mediawiki install,
HTML-only wikis, etc. All those other decisions inform what Parsoid
should focus on and how it evolves.

I think this is very well put. There's definitely a lot to think about.

MZMcBride



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Parsoid's progress

2015-01-19 Thread Amir E. Aharoni
Exactly: The HTML to wikitext conversion is what makes Parsoid useful, and
not only for VE.

Thanks to Parsoid, ContentTranslation has a simple rich text editor with
contenteditable (not a full VE, though this may change in the future). We
are just starting to deploy it to production, but the users who tested in
beta labs loved it.

The Parsoid way of converting wikitext to HTML is useful, too, because it
allows ContentTranslation to process the article that is being translated
in a formal and expected way, understanding where are links, images,
templates, timelines, references, etc., and adapting it automatically to
the translated article. All of this is done with simple jQuery selectors
and very little effort.
בתאריך 20 בינו 2015 04:01, ‏Matthew Flaschen mflasc...@wikimedia.org
כתב:

 On 01/19/2015 08:15 AM, MZMcBride wrote:

 Currently Parsoid is the largest client of the MediaWiki PHP parser, I'm
 told. If Parsoid is regularly calling and relying upon the MediaWiki PHP
 parser, what exactly is the point of Parsoid?


 Parsoid can go:

 wikitext = HTML = wikitext

 The MediaWiki parser can only go:

 wikitext = HTML

 The most important part of Parsoid is thus the HTML = wikitext conversion
 (required for VisualEditor), but other parts of their architecture follow
 from that.

  And from this question flows another: why is Parsoid
 calling MediaWiki's api.php so regularly?


 I think it uses it for some aspects of templates and hooks.  I'm sure the
 Parsoid team could explain further.

 Matt Flaschen

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Parsoid's progress

2015-01-19 Thread Matthew Flaschen

On 01/19/2015 08:15 AM, MZMcBride wrote:

Currently Parsoid is the largest client of the MediaWiki PHP parser, I'm
told. If Parsoid is regularly calling and relying upon the MediaWiki PHP
parser, what exactly is the point of Parsoid?


Parsoid can go:

wikitext = HTML = wikitext

The MediaWiki parser can only go:

wikitext = HTML

The most important part of Parsoid is thus the HTML = wikitext 
conversion (required for VisualEditor), but other parts of their 
architecture follow from that.



And from this question flows another: why is Parsoid
calling MediaWiki's api.php so regularly?


I think it uses it for some aspects of templates and hooks.  I'm sure 
the Parsoid team could explain further.


Matt Flaschen

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Parsoid's progress

2015-01-19 Thread MZMcBride
Matthew Flaschen wrote:
On 01/19/2015 08:15 AM, MZMcBride wrote:
 And from this question flows another: why is Parsoid
 calling MediaWiki's api.php so regularly?

I think it uses it for some aspects of templates and hooks.  I'm sure
the Parsoid team could explain further.

I've been discussing Parsoid a bit and there's apparently an important
distinction between the preprocessor(s) and the parser. Though in practice
I think parser is used pretty generically. Further notes follow.

I'm told in Parsoid, ref and {{!}} are special-cased, while most other
parser functions require using the expandtemplates module of MediaWiki's
api.php. As I understand it, calling out to api.php is intended to be a
permanent solution (I thought it might be a temporary shim).

If the goal was to just add more verbose markup to parser output, couldn't
we just have done that (in PHP)? Node.js was chosen over PHP due to
speed/performance considerations and concerns, from what I now understand.

The view that Parsoid is going to replace the PHP parser seems to be
overly simplistic and goes back to the distinction between the parser and
preprocessor. Full wikitext transformation seems to require a preprocessor.

MZMcBride



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Parsoid's progress

2015-01-19 Thread Arcane 21
If I might weigh in, I concur with MZMcBride. If Parsoid is absolutely needed 
regardless, that's one thing, but if a VE editing interface can be set up that 
doesn't need Parsoid, that would reduce dependence on third party software, 
make installation easier for all parties concerned, and not be as resource 
intensive, and since the optimization of resources is always a plus IMO, 
severing dependency on Parsoid and attempting to do it's current functions 
purely in house seems like a good plan to pursue.

 Date: Mon, 19 Jan 2015 11:15:54 -0500
 From: z...@mzmcbride.com
 To: wikitech-l@lists.wikimedia.org
 Subject: [Wikitech-l] Parsoid's progress
 
 (Combining pieces of Jay's thread and pieces of the shared hosting thread.)
 
 Daniel Friesen wrote:
 Parsoid can do Parsoid DOM to WikiText conversions. So I believe the
 suggestion is that storage be switched entirely to the Parsoid DOM and
 WikiText in classic editing just becomes a method of editing the content
 that is stored as Parsoid DOM in the backend.
 
 Tim Starling wrote:
  Parsoid depends on the MediaWiki parser, it calls it via api.php. It's
 not a complete, standalone implementation of wikitext to HTML
 transformation.
 
  
  HTML storage would be a pretty simple feature, and would allow
 third-party users to use VE without Parsoid. It's not so simple to use
 Parsoid without the MediaWiki parser, especially if you want to support
 all existing extensions.
 
  
  So, as currently proposed, HTML storage is actually a way to reduce the
 dependency on services for non-WMF wikis, not to increase it.
 
  Based on recent comments from Gabriel and Subbu, my understanding is
 that there are no plans to drop the MediaWiki parser at the moment.
 
 Yeah... what is this all about? My understanding (and please correct me if
 I'm wrong) is that Parsoid is/was intended to be a standalone service
 capable of translating wikitext -- HTML. You seem to be stating that
 Parsoid is neither complete nor standalone. Why?
 
 Currently Parsoid is the largest client of the MediaWiki PHP parser, I'm
 told. If Parsoid is regularly calling and relying upon the MediaWiki PHP
 parser, what exactly is the point of Parsoid?
 
 How much parity is there between Parsoid without the use of the MediaWiki
 parser and the MediaWiki parser? That is, if you selected a random sample
 of pages from a Wikimedia wiki, how many of them could Parsoid correctly
 parse on its own? And from this question flows another: why is Parsoid
 calling MediaWiki's api.php so regularly?
 
 I'm also interested in Parsoid's development as it relates to the broader
 push for services. If Parsoid is going to be the model of future services
 development, I'd like a clearer evaluation of what kind of model it is.
 
 Again, please correct me if I'm wrong, mistaken, misinformed, etc., but
 from my place of limited knowledge, it sounds very unappealing to create
 large Node.js applications (services) that closely tie in and require(!)
 PHP counterparts. This seems like the opposite of moving toward a more
 flexible, modular architecture. From my perspective, it would seem to only
 saddle us with additional technical debt moving forward, as we double
 complexity indefinitely.
 
 MZMcBride
 
 
 
 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l
  
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l