Re: [Cloud] Parsoid APIs?

Roy Smith Mon, 07 Sep 2020 08:17:45 -0700

Joaquin,

Thanks for your reply.


Regarding the data-parsoid route, I can't reproduce the trouble I was having.  
I suspect I was just getting the /revision/tid part wrong.

Taking a step back, I think part of the problem was I apparently had an 
incorrect mental model of how parsoid works.  I was envisioning something that 
took wikitext, parsed it into a semantic parse tree, (kind of like 
mwparserfromhell does), and then takes that parse tree and converts it to html. 
 What I was trying to get at was the intermediate parse tree.  Looking at 
https://www.mediawiki.org/wiki/Parsoid/API 
<https://www.mediawiki.org/wiki/Parsoid/API>, this appeared to be the 
pagebundle format, and I was groping around trying to find the API which 
exposed that.  I looked at the /html routes and thought to myself, "No, that's 
not what I want.  That's the HTML.  I want the parse tree".  So I was trying 
things like:

GET /:domain/v3/page/:format/:title/:revision?

with :format set to "pagebundle".  For example, I tried


> https://en.wikipedia.org/v3/page/pagebundle/banana 
> <https://en.wikipedia.org/v3/page/pagebundle/banana>
which 404's.

I think the biggest thing that could be done to improve the documentation is to 
update https://www.mediawiki.org/wiki/Parsoid/API 
<https://www.mediawiki.org/wiki/Parsoid/API>.  That's the page you get to most 
directly when searching for parsoid documentation.



> On Sep 7, 2020, at 6:05 AM, Joaquin Oltra Hernandez 
> <jhernan...@wikimedia.org> wrote:
> 
> Hi Roy,
> 
> Some responses inline:
> 
> 
> On Fri, Sep 4, 2020 at 6:41 PM Roy Smith <r...@panix.com 
> <mailto:r...@panix.com>> wrote:
> I know there's been a ton of work done of Parsoid lately.  This is great, and 
> the amount of effort that's gone into this functionality is really 
> appreciated.  It's clear that Parsoid is the way of the future, but the 
> documentation of how you get a Parsoid parse tree via an AP call isI kind of 
> confusing.
> 
> I found https://www.mediawiki.org/wiki/Parsoid/API 
> <https://www.mediawiki.org/wiki/Parsoid/API>, which looks like it's long out 
> of date.  The last edit was almost 2 years ago.  As far as I can tell, most 
> of what it says is obsolete, and refers to a series of /v3 routes which don't 
> actually exist.
> 
> This definitely looks outdated, I'll forward your email to the maintainers so 
> maybe they can have a look and update it.
>  
> 
> I also found https://en.wikipedia.org/api/rest_v1/#/Page%20content 
> <https://en.wikipedia.org/api/rest_v1/#/Page%20content>, which seems more in 
> line with the current reality.  But, the call I was most interested in, 
> /page/data-parsoid/{title}/{revision}/{tid}, doesn't actually respond (at 
> least not on en.wikipedia.org <http://en.wikipedia.org/>).
> 
> Maybe you can share exactly how you are querying the API and the responses 
> you get, since this does seem to work fine for me (examples below). I think 
> these APIs are the ones VisualEditor uses so they should work appropriately.
> 
> I tried querying https://en.wikipedia.org/api/rest_v1/page/html/Banana 
> <https://en.wikipedia.org/api/rest_v1/page/html/Banana> first, and got back 
> the response. On it, you can get the revision and "tid" from the ETag header, 
> like it says on the swagger docs:
> 
> ETag header indicating the revision and render timeuuid separated by a slash: 
> "701384379/154d7bca-c264-11e5-8c2f-1b51b33b59fc" This ETag can be passed to 
> the HTML save end point (as base_etag POST parameter), and can also be used 
> to retrieve the exact corresponding data-parsoid metadata, by requesting the 
> specific revision and tid indicated by the ETag.
> 
> With that information, you can then compose the new API call URL: 
> https://en.wikipedia.org/api/rest_v1/page/data-parsoid/Banana/975959204/7e3fb2f0-eb7b-11ea-bedb-95397ed6461a
>  
> <https://en.wikipedia.org/api/rest_v1/page/data-parsoid/Banana/975959204/7e3fb2f0-eb7b-11ea-bedb-95397ed6461a>
>  that should successfully respond with the metadata.
> 
> I'm not 100% clear on the difference between data-mw information on the 
> /page/html response vs the one found on the /page/data-parsoid response, but 
> anyhow you should be able to use both endpoints as needed that way.
>  
> 
> Eventually, I discovered (see this thread 
> <https://en.wikipedia.org/w/index.php?title=Wikipedia:Village_pump_(technical)&oldid=976731421#Parsing_wikitext_in_javascript?>),
>  that the way to get a Parsoid parse tree is via the 
> https://en.wikipedia.org/api/rest_v1/page/html/ 
> <https://en.wikipedia.org/api/rest_v1/page/html/> route,  and digging the 
> embedded JSON out of data-mw fragments scattered throughout the HTML.  This 
> seems counter-intuitive.  And kind of awkward, since it's not even a full 
> parse tree; it's just little snippets of parse trees, which I guess 
> correspond to each template expansion?
> 
> I looked around and found https://www.mediawiki.org/wiki/Specs/HTML/2.1.0 
> <https://www.mediawiki.org/wiki/Specs/HTML/2.1.0> linked on the Parsoid page, 
> which has extensive documentation on how wikitext <-> HTML is translated. It 
> seems to be more actively maintained. Hopefully this can give you some 
> insight on how the responses relate to the wikitext and how to find what you 
> want.
>  
> 
> So, taking a step backwards, my ultimate goal is to be able to parse the 
> wikitext of a page and discover the template calls, with their arguments. On 
> the server side, I'm doing this in Python with mwparserfromhell, which is 
> fine.  But now I need to do it on the client side, in browser-executed 
> javascript.  I've looked at a few client-side libraries, but if Parsoid 
> really is ready for prime time, it seems silly not to use it, and it's just a 
> question of finding the right API calls.
> 
> 
>  You may be interested in the #Template_markup 
> <https://www.mediawiki.org/wiki/Specs/HTML/2.1.0#Template_markup> section 
> from the previous spec given your problem statement.
>  
> 
> 
> _______________________________________________
> Wikimedia Cloud Services mailing list
> Cloud@lists.wikimedia.org <mailto:Cloud@lists.wikimedia.org> (formerly 
> lab...@lists.wikimedia.org <mailto:lab...@lists.wikimedia.org>)
> https://lists.wikimedia.org/mailman/listinfo/cloud 
> <https://lists.wikimedia.org/mailman/listinfo/cloud>
> _______________________________________________
> Wikimedia Cloud Services mailing list
> Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org)
> https://lists.wikimedia.org/mailman/listinfo/cloud

_______________________________________________
Wikimedia Cloud Services mailing list
Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud

Re: [Cloud] Parsoid APIs?

Reply via email to