Re: [Wikitech-l] RDFa and Microdata in MediaWiki

Aryeh Gregor Sat, 16 Jan 2010 17:26:43 -0800

On Sat, Jan 16, 2010 at 12:32 AM, Manu Sporny <mspo...@digitalbazaar.com> wrote:
> I don't know if you intended the tone of
> your e-mail in the way that I read it, but it came off as purposefully
> misleading based on the discussions that both you and I have had as
> members of the HTMLWG and WHATWG.

I do not claim to be an expert on RDFa, Microdata, or any similar
technology.  I'd prefer not to have to make a decision here at all,
and I've said so.  However, it looks like we (MediaWiki) have good
reason to use something or other.  For the reasons I gave, I think we
should choose whatever we believe is more likely to succeed, and
failing that, whatever we think is better (e.g., on grounds of
aesthetics or intuitiveness).  The example markup I gave might not be
ideal or accurate, but it serves to give a general idea of what the
markup looks like in each case, at least.  Thank you for your better
RDFa examples -- although it's telling that I was able to get
Microdata right on the first try, but apparently it took an RDFa
expert to figure out the correct RDFa.

However, as a Wikimedian, I'd like to point you to one of our core
guiding principles:

http://en.wikipedia.org/wiki/Wikipedia:Assume_good_faith

> One lesson that we learned during implementation of RDFa in Drupal is
> that it is helpful for CMS designers to pre-define vocabularies that are
> usable with their CMS systems if manual markup is necessary. Most markup
> of both Microdata and RDFa should also be left to the CMS code unless
> there is a very good reason to not do so.

The major use case for us is image licensing on Commons.  Currently
the license templates are generated "by hand" as in not hardcoded in
the software, but actually they're maintained by technically advanced
community members, so ordinary users don't see the markup.  To use my
example image, look at this page:

http://commons.wikimedia.org/wiki/File:EmeryMolyneux-terrestrialglobe-1592-20061127.jpg

You can see the wikitext source of the page by hitting "view source"
(or "edit" if it's unprotected by the time you read this) at the top.
The license info is generated by:

{{cc-by-2.0}}

This expands to:

<table cellspacing="8" cellpadding="0" style="width:100%; clear:both;
text-align:center; margin:0.5em auto; background-color:#f9f9f9;
border:2px solid #e0e0e0; direction: ltr;" class="layouttemplate">
<tr>
<td style="width:90px;" rowspan="3"><img alt="w:en:Creative Commons"
src="http://upload.wikimedia.org/wikipedia/commons/thumb/7/79/CC_some_rights_reserved.svg/90px-CC_some_rights_reserved.svg.png";
width="90" height="36" /><br />
<img alt="attribution"
src="http://upload.wikimedia.org/wikipedia/commons/thumb/1/11/Cc-by_new_white.svg/24px-Cc-by_new_white.svg.png";
width="24" height="24" /></td>
<td>This file is licensed under the <a
href="http://en.wikipedia.org/wiki/en:Creative_Commons"; class="extiw"
title="w:en:Creative Commons">Creative Commons</a> <a
href="http://creativecommons.org/licenses/by/2.0/deed.en";
class="external text" rel="nofollow">Attribution 2.0 Generic</a>
license.</td>
<td style="width:90px;" rowspan="3"></td>
</tr>
<tr style="text-align:center;">
<td></td>
</tr>
<tr style="text-align:left;">
<td>
<dl>
<dd>You are free:
<ul>
<li><b>to share</b> – to copy, distribute and transmit the work</li>
<li><b>to remix</b> – to adapt the work</li>
</ul>
</dd>
<dd>Under the following conditions:
<ul>
<li><b>attribution</b> – You must attribute the work in the manner
specified by the author or licensor (but not in any way that suggests
that they endorse you or your use of the work).</li>
</ul>
</dd>
</dl>
</td>
</tr>
</table>

(Not cutting-edge markup, but oh well.)  This is generated by the
contents of <http://commons.wikimedia.org/wiki/Template:Cc-by-2.0>,
which was created by the Commons community.  Pretty much all
boilerplate on Wikimedia projects is created by such templates.  So
when we enable Microdata and/or RDFa in MediaWiki wikitext, I'd expect
it to be used almost exclusively in templates, with few users actually
being directly exposed to it.  Since the content of MediaWiki pages
has no structure other than wikitext, basically we have to allow this
in wikitext to make it useful to mark up content.

I'll emphasize from the start that I do *not* think either RDFa or
microdata is suitable for dbpedia.org-style content.  There's no
reason we should put that in the HTML output, where it will take up
tons of space and not be useful to HTML consumers (e.g., browsers and
search engines).  That sort of data should be made available in a
separate stream for consumers who want it, in a dedicated format like
RDF.  That way HTML consumers aren't forced to download loads of
useless metadata, and metadata consumers aren't forced to download
loads of useless (and expensive-to-generate) HTML.  RDFa/Microdata
should *only* be used for metadata that's useful to HTML consumers of
some kind.

> If you want to allow manual markup of RDFa, MediaWiki should probably
> pre-define at least Dublin Core (used to describe creative works), FOAF
> (used to describe people and organizations), and Creative Commons (used
> to describe licenses).

I expect that we'd allow contributors to use whatever vocabularies
they'd like.  It's a wiki, after all.  :)

> The above could be marked up in RDFa, with pre-defined vocabs, like so:
>
> <p about="EmeryMolyneux-terrestrialglobe-1592-20061127.jpg"
>   typeof="dctype:StillImage">
> <span property="dc:title">Emery Molyneux Terrestrial Globe</span>
> by <a rel="cc:attributionUrl" href="http://example.org/bob/";
>      property="cc:attributionName">Bob Smith</span>
> is licensed under a <a rel="license"
> href="http://creativecommons.org/licenses/by-sa/3.0/us/";>Creative
> Commons Attribution-Share Alike 3.0 United States License</a>.</p>
>
> . . .
>
> So, four pieces of data, which is pretty good considering the
> compactness of the HTML code. The Microdata looks like this:
>
> <div id="bodyContent" itemscope="" itemtype="http://n.whatwg.org/work";>
> ...
> <img
> src="http://upload.wikimedia.org/wikipedia/commons/e/ef/EmeryMolyneux-terrestrialglobe-1592-20061127.jpg";
> width="640" height="480" itemprop="work">
> ...
> <p><span itemprop="title">Emery Molyneux Terrestrial Globe</span>
> by <span itemprop="author">Bob Smith</span> is licensed under a <a
> itemprop="license"
> href="http://creativecommons.org/licenses/by-sa/3.0/us/";>Creative
> Commons Attribution-Share Alike 3.0 United States License</a>.</p>
> </div>
>
> The compactness of the markup between Microdata and RDFa is more or less
> the same in this particular example.

You're comparing apples to oranges here: you included the div and img
for Microdata but not RDFa.  If you include that for RDFa, and also
count the xmlns:, it becomes (correct me if I'm wrong)

[[
<div id="bodyContent">
...
<img 
src="http://upload.wikimedia.org/wikipedia/commons/e/ef/EmeryMolyneux-terrestrialglobe-1592-20061127.jpg";
width="640" height="480">
...
<p about="EmeryMolyneux-terrestrialglobe-1592-20061127.jpg"
typeof="dctype:StillImage"><span
xmlns:dc="http://purl.org/dc/elements/1.1/"; property="dc:title">Emery
Molyneux Terrestrial Globe</span>
by <a xmlns:cc="http://creativecommons.org/ns#";
rel="cc:attributionUrl" href="http://example.org/bob/";
property="cc:attributionName">Bob Smith</span> is licensed under a <a
rel="license" href="http://creativecommons.org/licenses/by-sa/3.0/us/";>Creative
Commons Attribution-Share Alike 3.0 United States License</a>.</p>
]]

You do have to count the xmlns: somewhere.  Even if you put them on
the <html>, they still count at least once, and in this case they're
only used once on the page, so they deserve to count in full.  This is
685 characters.  On the other hand, Microdata:

[[
<div id="bodyContent" itemscope="" itemtype="http://n.whatwg.org/work";>
...
<img 
src="http://upload.wikimedia.org/wikipedia/commons/e/ef/EmeryMolyneux-terrestrialglobe-1592-20061127.jpg";
width="640" height="480" itemprop="work">
...
<p><span itemprop="title">Emery Molyneux Terrestrial Globe</span> by
<span itemprop="author">Bob Smith</span> is licensed under a <a
itemprop="license"
href="http://creativecommons.org/licenses/by-sa/3.0/us/";>Creative
Commons Attribution-Share Alike 3.0 United States License</a>.</p>
</div>
]]

525 characters.  Compare to the original with no extra semantics:

[[
<div id="bodyContent">
...
<img 
src="http://upload.wikimedia.org/wikipedia/commons/e/ef/EmeryMolyneux-terrestrialglobe-1592-20061127.jpg";
width="640" height="480">
...
<p>Emery Molyneux Terrestrial Globe by Bob Smith is licensed under a
<a href="http://creativecommons.org/licenses/by-sa/3.0/us/";>Creative
Commons Attribution-Share Alike 3.0 United States License</a>.</p>
</div>
]]

380 characters.  So Microdata adds 145 bytes, while RDFa adds 305: 2.1
times as much extra markup.  To be fair, you included an extra link to
http://example.org/bob/ which wasn't in the original example, but RDFa
is still about twice as many bytes.

It's not just bytes, though.  It's also complexity.  The Microdata is
*obvious*.  I've never used Microdata before in my life, or RDFa, but
somehow I got the Microdata right on the first try, while making
several errors in the RDFa.  It's not at all obvious what those xmlns:
things do, or what those cryptic prefixes mean.  Microdata is simpler
to understand at first glance for people from an HTML background.
Since you've been working with RDF for years, the magnitude of the
difference is probably not apparent to you.

> Getting Microdata and RDFa markup correct is easier if there are
> templates or if the semantic markup is performed automatically by the
> CMS based on a pre-defined form. For example,
> http://en.wikipedia.org/wiki/Augustus, note the Infobox on the
> right. It would be much better for the RDFa markup to happen
> automatically via MediaWiki's template process, than for it to be marked
> up by
> hand.

As I noted, the templates are made by hand, by each community.  The
software just gives the ability to include one page in another with
simple substitutions made.  The infobox on the Augustus article is
<http://en.wikipedia.org/wiki/Template:Infobox_royalty>, invoked like
so:

{{Infobox royalty
| name            = Caesar Augustus
| title           = [[Roman Emperor|Emperor]] of the [[Roman Empire]]
. . . snip 18 lines . . .
| place of death  = [[Nola]], [[Italia (Roman Empire)|Italia]], [[Roman Empire]]
| place of burial = [[Mausoleum of Augustus]], Rome
|}}

The template authors would be the ones to add semantics here, not the
software developers.  There are a couple orders of magnitude more wiki
editors than software developers, so it just wouldn't be practical for
the developers to be the ones to assign semantic markup to each and
every template.  Moreover, as you can tell from the HTML output of the
templates, template editors tend to be of the "copy-paste stuff until
it works" school of HTML authorship.  So you cannot argue here that
RDFa is just as good if we abstract away the actual markup.  We aren't
in a position to do that -- users with little to no knowledge of RDFa
or microdata will be editing the raw markup, and that has to be taken
into account.

> Intentional or not, Aryeh has painted RDFa in a negative light by not
> outlining a number of points related to adoption and both RDFa and
> Microdata's current status in the HTML Working Group. Adopting either
> RDFa or Microdata in an HTML5 document type would be premature
> at this time because both have not progressed past the Editors Draft
> stage yet. Either is subject to change as far as HTML5 is concerned
> and we really don't want you to ship HTML5 features before they've had
> a chance to solidify a bit more.
>
> However - XHTML1+RDFa is a published W3C Recommendation and it is safe
> to use it for deployment.

Microdata is also safe to use for deployment.  Like other web
technologies maintained by the WHATWG, it will not change once it's
widely adopted, and Wikipedia adoption would probably count as wide
adoption by itself.  Note that microdata, like all of HTML5, is at
Last Call at the WHATWG, independent of its status as Working Draft in
the W3C.

I've asked Hixie how stable Microdata is.  Since he's the sole person
who decides on changes to HTML5 at the WHATWG, as you know, his answer
should be authoritative.

> Google[1] is actively indexing RDFa today as
> is Yahoo[2]. Sites such as Digg, Whitehouse.gov, the UK Government, The
> Public Library of Science, O'Reilly and the UK Government are
> high-profile sites that publish their pages using RDFa. Data formats
> such as XHTML1, SVG 1.2 and ODF have integrated RDFa as a core part of
> their language. Best Buy saw a 30% traffic increase after publishing
> their pages in RDFa using the GoodRelations vocabulary. I'm sure
> everyone here is aware of dbpedia.org[3] and Freebase[4] - which use RDF
> as a semantic representation format. dbpedia, which gets its data from
> Wikipedia, shows 479 million triples available - so that
> should give you folks some idea of the treasure trove of immediately
> extractable semantic data we're talking about.
>
> Make no mistake - RDFa has very strong deployment at this point and it
> will continue to grow past 100,000+ sites with the upcoming release of
> Drupal 7.

Right -- because microdata is so new.  How many of those groups
actually considered using microdata?  I'd guess roughly none, because
in most cases, microdata either didn't exist or was barely known.  If
microdata is much more intuitive and simpler to use, I'd expect it to
win in the long run, say five years from now.  RDFa isn't so widely
used that it can't be easily defeated by a clearly superior
technology.

On Sat, Jan 16, 2010 at 6:37 AM, Philip Jägenstedt <phi...@foolip.org> wrote:
> Is Wikipedia using XHTML served as application/xml+xhtml?

No.  We're currently using XHTML1.0 served as text/html.  I expect us
to switch to HTML5 served as text/html (which happens to also be
well-formed XML) before we deploy support for either microdata or
RDFa.

On Sat, Jan 16, 2010 at 5:16 PM, Manu Sporny <mspo...@digitalbazaar.com> wrote:
> You would do this in RDFa:
>
> <div about="#light">
> The speed of light is <span property="measure:speed"
> datatype="measure:meters-per-second">299792458</span> m/s.
> </div>
>
> which would generate the following triple:
>
> <#light>
>   measure:speed
>      "299792458"^^measure:meters-per-second .
>
> AFAIK, there is no way to do the equivalent in Microdata, is there Philip?

You could define different properties for different units, or allow
the data to include unit info directly.  Like

<span itemprop="speed">299792458 m/s</span>

and have the format itself define what "m/s" means.  I don't see this
as a practical issue in MediaWiki, given our use-cases (in particular,
emphatically excluding markup of data that's useless to typical HTML
consumers).

> An RDF reasoner would know that not only is the data not typed, but even
> if it were typed, the value "fast enough to hurt" is not valid.

A microdata standard would also define what type of data is valid.
For instance, from the license vocabulary: "The value must be an
absolute URL."  "The value must be either an item with the type
http://microformats.org/profile/hcard, or text."

> What happens when an author forgets to include itemtype?

The same as if an author forgets to include xmlns:.  It's not tied to
any vocabulary, you have to either guess or ignore it.  It's not
ambiguous, it's just meaningless.  There's no difference to RDFa here,
except that RDFa encourages you to link to the profile IDs on the
<html> element, which is much more likely to break under copy-paste.

> RDFa is built on a concept called "follow your nose", which means that
> all vocabulary term URLs in RDFa, such as
> http://purl.org/media/audio#Recording, should be dereference-able and at
> the end of that URL should be a machine-readable description of the
> vocabulary term. Preferably, a human-readable description should also
> exist at that URL.

The perils of using URLs like this are well-known.  Just ask the W3C
how many hits it gets for DTDs every second.  Microdata deliberately
and wisely avoids using URLs that machines are intended to
dereference.  On the other hand, humans can find the info easily:

http://www.google.com/search?q=http://n.whatwg.org/work

I imagine it's meant to resolve to a human-readable spec, though, for
the same discoverability as RDFa.  It's probably an oversight, I've
asked Hixie to clarify.

> Philip, could you give us an update on what the WHATWG sees as the
> publishing process for Microdata vocabularies? For example, if Wikipedia
> wanted to start expressing royal bloodlines using a vocabulary specific
> to Wikipedia, how would they go about getting that vocabulary into the
> HTML5 Microdata specification?

We don't have to.  See the spec:

"The item type must be a type defined in an applicable specification."
http://www.whatwg.org/specs/web-apps/current-work/multipage/microdata.html#item-type

"Applicable specification" links to

"When vendor-neutral extensions to this specification are needed,
either this specification can be updated accordingly, or an extension
specification can be written that overrides the requirements in this
specification. When someone applying this specification to their
activities decides that they will recognise the requirements of such
an extension specification, it becomes an applicable specification for
the purposes of conformance requirements in this specification."
http://www.whatwg.org/specs/web-apps/current-work/multipage/infrastructure.html#other-applicable-specifications

Anyone can write their own extension specification -- it becomes
"applicable" as soon as anyone decides to use it.

> It's like saying that programming in Python is more error prone than
> programming in PHP - it depends entirely on the skill of the developer,
> what you're doing, and many other factors that are out of the hands of
> language designers.

I think you'll find most MediaWiki developers strongly agree that PHP
is a terrible language and Python is way better, so maybe that was a
bad analogy.  :)

> Besides, the Wikipedia community has done a fantastic job of generating
> valid XHTML:

Well, rather, MediaWiki has done a good job there, despite all
attempts by the community.  ;)  Community inputs tag soup, MediaWiki
converts to valid XHTML.  But that's purely syntactic.  You can tell
from the extensive usage of tables that Wikipedians don't care about
standards or theoretical purity, they just try to get things to work
right.  That has to be taken into account.

On Sat, Jan 16, 2010 at 5:39 PM, Platonides <platoni...@gmail.com> wrote:
> Perhaps we shouldn't provide the full power of RDF or Microdata yet, and
> provide instead a extension able to handle a subset, using one or another.

What sort of user-visible syntax would you suggest?  We'd still have
to use either RDFa or microdata for the actual output, so it doesn't
save us much.

On Sat, Jan 16, 2010 at 7:09 PM, Happy-melon <happy-me...@live.com> wrote:
> I know sod all about either of them except what has been posted
> here, but I see that they're incredibly similar, but just different enough
> to be incompatible; and I see that they are both horribly difficult for the
> lay-editor to use.  By that I mean that the discussion between "oh this one
> only requires us to put in two new attributes instead of three" misses the
> elephant in the room: *both* formats require us to whitelist and start
> filling our wikitext with the HTML tag that the most iconic piece of
> wikimarkup, the double brackets, have kept hidden for nine years.

I don't think microdata is harder to use than HTML generally.  It's
sure a lot easier to use than wikitext template syntax (look at some
of those enwiki monstrosities).

> and b) even the most
> careful implementation is going to manifest itself in article wikitext along
> the lines of ""{{person|John Smith}}, born {{birthdate|12 June 1987}}, was a
> {{occupation|football player}} for {{organisation|Puddlemere United}}"".  Or
> something like that.

No, I don't think we'd do that at all.  We'd add microdata (or RDFa)
to things like license templates, and maybe infobox templates.  So
this would all be hidden behind templates people are already using
anyway.  The goal is immediately useful metadata like licenses -- we
want web crawlers to be able to automatically tell what licenses
images are under, say.  Abstract stuff like you're marking up
shouldn't be provided with the HTML output, and should be input as
part of infoboxes (since people do that anyway).

> There seem to be two usecases for these systems.  First, marking up the
> 'stuff' that MediaWiki serves: images, copyright links, author links, etc.
> That requires MW to be able to get hold of the raw data for, for instance,
> an image license; and that's begging for things like new magic words to put
> on the image description page, not for enabling either format directly in
> wikitext.  The only reason to do *that*, is to support editors marking up
> *their own stuff*, and that's where we have problems.

I don't follow.  Why can't you just alter {{cc-by-2.0}} or whatever on
Commons so it outputs the right markup?  MediaWiki doesn't have to do
anything beyond allowing the markup to begin with.

> TLDR version: jumping on either bandwagon is neither necessary nor sensible,
> and we should avoid getting drawn into the issue.

I would agree, except that we have an immediate potential use: marking
up image licenses so image crawlers know how the images are licensed.
Google already hardcodes Wikipedia licenses, apparently, but we should
use standards-based machine-readable markup for the benefit of all the
other MediaWikis, and any Wikimedia wikis they haven't hardcoded, and
Commons too if they change a template name or something and break the
scraping, etc.  This is why Duesentrieb added the feature.  Unless we
all agree it's not worth getting into this for the sake of that
use-case, we do have to address the issue now.

On Sat, Jan 16, 2010 at 7:13 PM, Manu Sporny <mspo...@digitalbazaar.com> wrote:
> Just to be clear - I'm not trying to propose that wikipedia editors
> should start writing wiki markup interleaved with RDFa/Microdata. Quite
> the opposite - I think that allowing contributors to hand author RDFa or
> Microdata would be a very bad idea for Wikipedia. However, it seems like
> what you are saying is that interleaving HTML like this is not possible
> anyway - which is a good thing, IMHO.

HTML can be interleaved with wikitext.  This is needed because all
templates are written in wikitext, for instance.  Templates are just
chunks of wikitext that can get included in other pages, optionally
with some predefined parameters substituted with strings of yet more
wikitext.  So MediaWiki recursively substitutes all templates (along
with other things like conditional constructs) with their wikitext
output before evaluating the whole resulting mess as a single wikitext
string.

> Does anybody have a link to a previous discussion about how to get
> Wikipedia to output the same data that dbpedia.org is publishing?

As far as I can tell, dbpedia.org just has people manually sift
through Wikipedia templates and translate them to RDF.  Things like
infoboxes naturally lend themselves to users inputting key-value
pairs, which can easily be translated to RDF triples.  I don't think
we should use either microdata or RDFa for this kind of data-mining
use-case -- it would be way too much markup and not useful to
practically any viewers.  People who want to data-mine can use a
separate data stream, possibly RDF, possibly autogenerated by
MediaWiki.  Inline metadata is only ideal for things you want either
browsers, search engines, etc. to see.

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] RDFa and Microdata in MediaWiki

Reply via email to