On Android we could use Html.fromHtml() <http://developer.android.com/reference/android/text/Html.html#fromHtml(java.lang.String, android.text.Html.ImageGetter, android.text.Html.TagHandler)> to strip the HTML tags.
Bernd On Mon, Dec 8, 2014 at 4:03 PM, Gergo Tisza <gti...@wikimedia.org> wrote: > Hi Dan! > > On Mon, Dec 8, 2014 at 11:29 AM, Dan Garry <dga...@wikimedia.org> wrote: > >> *Background:* The Mobile Apps Team is working on a restyling of the way >> content the first fold of content is presented in the Wikipedia app. You >> can see this image <http://i.imgur.com/dxqfJKd.png> to see what this >> looks like. >> > > That looks awesome, can't wait to see it live! Any chance of something > like this eventually hitting the desktop site? :-) > > Having a high-resolution image so prominently at the top of the page will >> likely drive a lot of clicks, so we're working on a lightweight image >> viewer to deal with file pages, which are poorly styled monstrosities on >> the mobile app. We're going to use the CommonsMetadata API to help us out. >> :-) >> > > Keep in mind that there is no guarantee the API output is an accurate > representation of the file page (lack of machine-readable template markup > etc. - for example, CommonsMetadata can't figure out the license name for > about 5% of the MediaViewer pageviews), so you'll still need a link to the > raw file page somewhere. > > *Problem:* The CommonsMetadata API can sometimes return HTML [1]. Having >> HTML in the API response is a bit problematic for us. Native apps make next >> to no use of HTML when creating links or layouts, so we have to strip the >> HTML from every API response, lest it be displayed as plaintext to the >> user. In the short term this is fine, we can strip it and throw the >> information away. But in the long run it'd be better if the API didn't >> return HTML. >> > > In the long run CommonsMetadata should die in a fire, together with the > Commons paradigm of storing information in license parameters. > You can see the related plans at Commons:Structured data > <https://commons.wikimedia.org/wiki/Commons:Structured_data>; these > include migrating most information to plaintext (file descriptions will > probably remain rich text). > > In the not so long run, some HTML markup is fairly important. Links can be > necessary for the attribution, paragraphs for making long descriptions more > readable; removing lists and tables makes some descriptions unreadable (map > legends tend to use tables, for example). So I think the API would be much > less useful if it started stripping HTML. (It does that already in a few > cases where the intent is clear, such as stripping the enclosing <p> > generated by MediaWiki, or stripping certain kinds of purely presentational > markup such as creator templates > <https://commons.wikimedia.org/wiki/Template:Creator>, but that only > works when the source and intent of the markup is known.) > > We could add an API parameter to provide a plaintext version, but that > would split the cache (both varnish and memcached). Not a huge deal, but > tag stripping is very easy, so if you don't need anything more specific > than that, I would say it is simpler to do it on the client side. If more > complex logic is needed (e.g. turning <ul>s into star lists), it makes > sense to do that in the API instead of forcing each client to reimplement > it, but I am not sure how generic such a text representation would be. > > So, given that we can't do anything meaningful with the HTML in a native >> app, that means we only have three options: >> >> - Display the raw HTML directly to the user >> >> >> - Try to parse the HTML for interesting information and update the >> relevant view's properties using native code >> >> >> - Strip any and all HTML tags that are given to us in the JSON >> >> The first two aren't sounding workable at all to me; the first is >> unworkable from a product standpoint, and the second is an absolutely >> gigantic can of worms. So I guess we'll be stripping the HTML until such >> time that this is fixed. :-) > > > I'm not sure some limited HTML parsing is that bad. The low-hanging fruit > is links (MediaViewer currently strips everything else, and most of the > time that works decently), and those are never nested, so they can be > processed by a trivial SAX parser, for which all platforms surely have > libraries. > > _______________________________________________ > Mobile-l mailing list > Mobile-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/mobile-l > >
_______________________________________________ Mobile-l mailing list Mobile-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mobile-l