Hi all,

I've been working on an api module/extension to extract metadata from
commons image description pages, and display it in the API. I know
this is an area that various people have thought about from time to
time, so I thought it would be of interest to this list.

The specific goals I have:
*Should be usable for a light box type feature ("MediaViewer") that
needs to display information like Author and license. [1] (This is
primary use case)
*Should be generic where possible, so that better metadata access can
be had by all wikis, even if they don't follow commons conventions.
For example, should generically support exif data from files where
possible/appropriate, overriding the exif data when more reliable
sources of information are available.
*Should be compatible with a future wikidata on commons thing. [2]
**In particular, I want to read existing description page formatting,
not try and force people to use new parser functions or formatting
conventions, since they may become outdated in near future when
wikidata comes
**Hopefully Wikidata would be able to hook into my system (Well at the
same time providing its own native interface)
*Since descriptions on commons are formatted data (Wikilinks are
especially common) it needs to be able to output formatted data. I
think html is the most easy to use format. Much more easy to use than
say wikitext (However this is perhaps debatable)


What I've come up with is a new api metadata property (Currently
pending review in gerrit) called extmetadata that has a hook
extensions can hook into. [3] [4] [5] Additionally I developed an
extension for reading information from commons description pages. [6]

It combines information from both the file's metadata, and from any
extensions. For example, if the Exif data has an author specified
("Artist" in exif speak), and the commons description page also has
one specified, the description page takes precedence, under the
assumption its more reliable. The module outputs html, since that's
the type of data stored in the image description page (Except that it
uses full urls instead of local ones).

The downside to this is in order to effectively get metadata out of
commons given the current practises, one essentially has to screen
scrape and do slightly ugly things (Look ahead for a brighter tomorrow
with wikidata!)

As an example, given a query like
api.php?action=query&prop=imageinfo&iiprop=extmetadata&titles=File:Schwedenfeuer_Detail_04.JPG&format=xmlfm&iiextmetadatalanguage=en
 it would produce something like [7]

So thoughts? /me eagerly awaits mail tearing my plans apart :)

[1] https://www.mediawiki.org/wiki/Multimedia/Media_Viewer
[2] https://commons.wikimedia.org/wiki/Commons:Wikidata_for_media_info
[3] https://gerrit.wikimedia.org/r/#/c/81598/
[4] https://gerrit.wikimedia.org/r/#/c/78162/
[5] https://gerrit.wikimedia.org/r/#/c/78926/
[6] https://gerrit.wikimedia.org/r/#/c/80403/
[7] http://pastebin.com/yh5286iR

--
Bawolff

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to