Re: [Wikitech-l] I love Parsoid but it doesn't want me

Ricordisamoa Sat, 01 Aug 2015 00:39:59 -0700

Il 01/08/2015 01:20, Subramanya Sastry ha scritto:

On 07/31/2015 12:55 PM, Ricordisamoa wrote:
Hi Subbu,
thank you for this thoughtful insight.
And thank you for starting this thread. :-)
HTML is not a barrier by itself. The problem seems to be Parsoidbeing built primarily with VisualEditor in mind.
While we want the DOM to be VE-friendly, we definitely don't want theDOM to be VE-centric and that has been the intention from the verybeginning. Flow, CX also use the Parsoid DOM for their functionality.There are other users too [1].

VE, Flow, CX all take advantage of HTML. And I can't make any sense outof editProtectedHelper.js<https://en.wikipedia.org/wiki/User:Jackmcbarn/editProtectedHelper.js> :'(

We definitely want Parsoid's output to be useful and usable morebroadly as the canonical output representation of wikitext and areopen to fixing whatever prevents that.
As Scott noted in the other email on the thread, inspired (and maybechallenged by :-) ) by mwparserfromhell's utilities, he has alreadywhipped out a layer that provides an easier interface for manipulatingthe DOM.
It is not clear to me how can a single DOM serving both view and editmodes avoid redundancy.
You are right that there are some redundancies in informationrepresentation (because of having to serve multiple needs), but as faras I know, it is mostly around image attributes. If there is anythingelse specific (beyond image attributes) that is bothering you, can youflag that?


https://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec#Transclusion_content

All template parameters are in data-mw but not parsed. Parameters endingup in the 'final' wikitext are parsed separately.

I see huge demand for alternative wikignome-style editors. The moreParsoid's DOM is predictable, concise and documented, the more usersyou get.
I think Parsoid's DOM is predictable :-) but, can you say more aboutwhat prompted you to say that?

For example, to find images I have to search elements where typeof isone of mw:Image, mw:Image/Thumb, mw:Image/Frame, mw:Image/Frameless,then see if it's a figure or a span, and expect either a <figcaption> ordata-mw accordingly. Add that the img tag's parent can be <a> or <span>...

Instead, this is what I'd expect a proper structure to look like:

Image
.src = title, internal or external link?
.repository?
.page = number or null
.language = code or null
.format = thumb etc.
.caption = wikitext parsed recursively
.link = internal or external link or null
.size
 .original
  .width = 1234
  .height = 4321
 .specified
  .width = 2468
 .computed
  .width = 2468
  .height = 8642

As for documentation, we document the DOM we generate and itssemantics here [2].

It seems that some sections need updates, e.g. noinclude / includeonly /onlyinclude<https://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec#noinclude_.2F_includeonly_.2F_onlyinclude>

As for size, I just looked at the Barack Obama page and here are somesize numbers.


By "concise" I meant an antonym for redundant, not lengthy :-)

1540407 /tmp/Barack_Obama.parsoid.html
1197318 /tmp/Barack_Obama.parsoid.no-data-mw.html
1045161 /tmp/Barack_Obama.php-parser.output.footer-stripped.html
Right now, because we inline template and other editable information(as inline JSON attributes of the DOM), it is a bit bulky. However, wehave always had plans to move the data-mw attribute into its ownbucket which we might at some point in which case the size will becloser to the current PHP parser output. If we moved page propertiesand other metadata out, it will shrink it a little bit more.
For views that don't need to support editing or any other manipulationor analyses, we can more aggressively strip more from the HTML withoutaffecting the rendering


Stripping HTML altogether would be a huge step forward. :-)

and get close to or even shrink the size below the PHP parser outputsize (there might be use cases where that might be appropriate thingto do). I could get this down to under 1M by stripping rel attributes,element ids, and about ids for identifying template output.
But, for editing (not just in VE) use cases, because of additionalmarkup in place on the page (element ids, other markup fortransclusions, extensions, links, etc.), the output will probably besomewhat larger than the corresponding PHP parser output. If we cankeep it under 1.1x of php parser output size, I think we are good.
I hope we can meet in the middle :-)
Please file bugs and continue to report things that get in the way ofusing Parsoid.
Subbu.

[1] https://www.mediawiki.org/wiki/Parsoid/Users
[2] http://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] I love Parsoid but it doesn't want me

Reply via email to