Re: [WikimediaMobile] [Multimedia] CommonsMetadata API returning HTML?

Bernd Sitzmann Mon, 08 Dec 2014 15:18:58 -0800

On Android we could use Html.fromHtml()
<http://developer.android.com/reference/android/text/Html.html#fromHtml(java.lang.String,
android.text.Html.ImageGetter, android.text.Html.TagHandler)> to strip the
HTML tags.


Bernd

On Mon, Dec 8, 2014 at 4:03 PM, Gergo Tisza <gti...@wikimedia.org> wrote:

> Hi Dan!
>
> On Mon, Dec 8, 2014 at 11:29 AM, Dan Garry <dga...@wikimedia.org> wrote:
>
>> *Background:* The Mobile Apps Team is working on a restyling of the way
>> content the first fold of content is presented in the Wikipedia app. You
>> can see this image <http://i.imgur.com/dxqfJKd.png> to see what this
>> looks like.
>>
>
> That looks awesome, can't wait to see it live! Any chance of something
> like this eventually hitting the desktop site? :-)
>
> Having a high-resolution image so prominently at the top of the page will
>> likely drive a lot of clicks, so we're working on a lightweight image
>> viewer to deal with file pages, which are poorly styled monstrosities on
>> the mobile app. We're going to use the CommonsMetadata API to help us out.
>> :-)
>>
>
> Keep in mind that there is no guarantee the API output is an accurate
> representation of the file page (lack of machine-readable template markup
> etc. - for example, CommonsMetadata can't figure out the license name for
> about 5% of the MediaViewer pageviews), so you'll still need a link to the
> raw file page somewhere.
>
> *Problem:* The CommonsMetadata API can sometimes return HTML [1]. Having
>> HTML in the API response is a bit problematic for us. Native apps make next
>> to no use of HTML when creating links or layouts, so we have to strip the
>> HTML from every API response, lest it be displayed as plaintext to the
>> user. In the short term this is fine, we can strip it and throw the
>> information away. But in the long run it'd be better if the API didn't
>> return HTML.
>>
>
> In the long run CommonsMetadata should die in a fire, together with the
> Commons paradigm of storing information in license parameters.
> You can see the related plans at Commons:Structured data
> <https://commons.wikimedia.org/wiki/Commons:Structured_data>; these
> include migrating most information to plaintext (file descriptions will
> probably remain rich text).
>
> In the not so long run, some HTML markup is fairly important. Links can be
> necessary for the attribution, paragraphs for making long descriptions more
> readable; removing lists and tables makes some descriptions unreadable (map
> legends tend to use tables, for example). So I think the API would be much
> less useful if it started stripping HTML. (It does that already in a few
> cases where the intent is clear, such as stripping the enclosing <p>
> generated by MediaWiki, or stripping certain kinds of purely presentational
> markup such as creator templates
> <https://commons.wikimedia.org/wiki/Template:Creator>, but that only
> works when the source and intent of the markup is known.)
>
> We could add an API parameter to provide a plaintext version, but that
> would split the cache (both varnish and memcached). Not a huge deal, but
> tag stripping is very easy, so if you don't need anything more specific
> than that, I would say it is simpler to do it on the client side. If more
> complex logic is needed (e.g. turning <ul>s into star lists), it makes
> sense to do that in the API instead of forcing each client to reimplement
> it, but I am not sure how generic such a text representation would be.
>
> So, given that we can't do anything meaningful with the HTML in a native
>> app, that means we only have three options:
>>
>>    - Display the raw HTML directly to the user
>>
>>
>>    - Try to parse the HTML for interesting information and update the
>>    relevant view's properties using native code
>>
>>
>>    - Strip any and all HTML tags that are given to us in the JSON
>>
>> The first two aren't sounding workable at all to me; the first is
>> unworkable from a product standpoint, and the second is an absolutely
>> gigantic can of worms. So I guess we'll be stripping the HTML until such
>> time that this is fixed. :-)
>
>
> I'm not sure some limited HTML parsing is that bad. The low-hanging fruit
> is links (MediaViewer currently strips everything else, and most of the
> time that works decently), and those are never nested, so they can be
> processed by a trivial SAX parser, for which all platforms surely have
> libraries.
>
> _______________________________________________
> Mobile-l mailing list
> Mobile-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/mobile-l
>
>

_______________________________________________
Mobile-l mailing list
Mobile-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/mobile-l

Re: [WikimediaMobile] [Multimedia] CommonsMetadata API returning HTML?

Reply via email to