Re: [Wikidata-l] supported and planned wikidata uris( was Re:Meta header for asserting that a web page is about a Wikidata subject)

2014-04-24 Thread Michael Smethurst


On 24/04/2014 11:10, "David Cuenca"  wrote:

>On Thu, Feb 27, 2014 at 11:49 AM, Michael Smethurst
> wrote:
>
>
>If I know everything needed to construct a wikipedia uri (language and uri
>key) is it possible to construct a uri that redirects to a wikidata Q
>style uri?
>
>Are there any convenience uris to map from wikipedia to wikidata?
>
>
>
>
>
>You can use Special:ItemByTitle like this:
>http://www.wikidata.org/wiki/Special:ItemByTitle/enwiki/BBC

Haha, fantastic

Thanks Micru

michael
>
>
>
>Cheers,
>Micru
>
>
>



-
http://www.bbc.co.uk
This e-mail (and any attachments) is confidential and
may contain personal views which are not the views of the BBC unless 
specifically stated.
If you have received it in
error, please delete it from your system.
Do not use, copy or disclose the
information in any way nor act in reliance on it and notify the sender
immediately.
Please note that the BBC monitors e-mails
sent or received.
Further communication will signify your consent to
this.
-

___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l


Re: [Wikidata-l] supported and planned wikidata uris( was Re:Meta header for asserting that a web page is about a Wikidata subject)

2014-02-27 Thread Michael Smethurst
hello

On 27/02/2014 08:44, "Markus Krötzsch" 
wrote:

>Hi,
>
>On 26/02/14 22:40, Michael Smethurst wrote:
>> Hello
>>
>> *Really* not meaning to jump down any http-range-14 rabbit holes but
>> wasn't there a plan for wikidata to have uris representing things and
>> pages about those things?
>>
>>  From conversations on this list I sketched a picture a while back of
>>all
>> the planned URIs:
>> http://smethur.st/wp-uploads/2012/07/46159634-wikidata.png
>>
>>
>> Where
>> http://wikidata.org/id/Qetc
>> Was the "thing" uri (which you could point a foaf:PrimaryTopic at)
>
>As Denny said in reply to another message, the preferred URI for this is
>
>http://www.wikidata.org/entity/Qetc
>
>This is also the form of URIs used within Wikidata data for certain
>things (e.g., coordinates that refer to earth use the URI
>"http://www.wikidata.org/entity/Q2"; to do so, even in JSON).

Ok, makes sense

So the correct "sem web way" would be:
https://www.wikidata.org/entity/Qetc";>


And the schema.org way would be
https://wikidata.org/wiki/Qetc";
property="http://schema.org/sameAs"; />


>
> > and
>> http://wikidata.org/wiki/Qetc
>>
>> Was the document uri
>
>Yes. However, for metadata it is usually preferred to use the entity
>URI, since the document http://wikidata.org/wiki/Qetc is just an
>automatic UI rendering of the data, and as such relatively
>uninteresting. One will eventually get (using content negotiation) all
>data in RDF from http://www.wikidata.org/entity/Qetc (JSON should
>already work, and html works of course, when opening the entity URI in
>normal browsers). The only reason for using the wiki URI directly would
>be if one uses a property that requires a document as its value, but in
>this case one should probably better use another property.

Does that conflate the can't send that / 303 part with the content
negotiation part?

Guessing it follows the dbpedia pattern which isn't always nice to work
with

Personally would prefer /entity/ to 303 to a generic document uri and do
the conneg part from there

===

What about the second part of the question? Is there a full list of
supported uri patterns for wikidata?

If I know everything needed to construct a wikipedia uri (language and uri
key) is it possible to construct a uri that redirects to a wikidata Q
style uri?

Are there any convenience uris to map from wikipedia to wikidata?

Thanks
michael
>
>Best regards,
>
>Markus
>
>
>>
>> Mainly asking not for the wikipedia > wikidata relationships but
>>wondering
>> if there's a more up to date picture of supported wikidata uri patterns
>> and redirects?
>>
>> Recently I was trying to find a way to programmatically get wikidata
>>uris
>> from wikipedia uris and tried various combinations of:
>> http://wikidata.org/title/enwiki:Berlin
>> http://en.wikidata.org/item/Berlin
>> http://en.wikidata.org/title/Berlin
>>
>>
>> (all mentioned on the list / wiki) but all of them return a 404
>>
>> Is the a way to do this?
>>
>> Michael
>>
>>
>>
>>
>> On 26/02/2014 19:09, "Dan Brickley"  wrote:
>>
>>> On 26 February 2014 10:45, Joonas Suominen
>>>
>>> wrote:
>>>> How about using RDFa and foaf:primaryTopic like in this example
>>>> https://en.wikipedia.org/wiki/RDFa#XHTML.2BRDFa_1.0_example
>>>>
>>>> 2014-02-26 20:18 GMT+02:00 Paul Houle :
>>>>
>>>>> Isn't there some way to do this with schema.org?
>>>
>>> The FOAF options were designed for relations between entities and
>>> documents -
>>>
>>> foaf:primaryTopic relates a Document to a thing that the doc is
>>> primarily about (i.e. assumes entity IDs as value, pedantically).
>>>
>>> the inverse, foaf:isPrimaryTopicOf, was designed to allow an entity
>>> description in a random page to anchor itself against well known
>>> pages. In particular we had Wikipedia in mind.
>>>
>>> http://xmlns.com/foaf/spec/#term_primaryTopic
>>> http://xmlns.com/foaf/spec/#term_isPrimaryTopicOf
>>>
>>> (Both of these share a classic Semantic Web pickyness about
>>> distinguishing things from pages about those things).
>>>
>>> Much more recently at schema.org we've added a new
>>> property/relationship called http://schema.org/sameAs
>>>
>>> It relates an entity to a reference page (e.g. wikipedia) that can be
>>> used as a kind of proxy identifier for

[Wikidata-l] supported and planned wikidata uris( was Re:Meta header for asserting that a web page is about a Wikidata subject)

2014-02-26 Thread Michael Smethurst
Hello

*Really* not meaning to jump down any http-range-14 rabbit holes but
wasn't there a plan for wikidata to have uris representing things and
pages about those things?

From conversations on this list I sketched a picture a while back of all
the planned URIs:
http://smethur.st/wp-uploads/2012/07/46159634-wikidata.png


Where
http://wikidata.org/id/Qetc
Was the "thing" uri (which you could point a foaf:PrimaryTopic at) and
http://wikidata.org/wiki/Qetc

Was the document uri

Mainly asking not for the wikipedia > wikidata relationships but wondering
if there's a more up to date picture of supported wikidata uri patterns
and redirects?

Recently I was trying to find a way to programmatically get wikidata uris
from wikipedia uris and tried various combinations of:
http://wikidata.org/title/enwiki:Berlin
http://en.wikidata.org/item/Berlin
http://en.wikidata.org/title/Berlin


(all mentioned on the list / wiki) but all of them return a 404

Is the a way to do this?

Michael




On 26/02/2014 19:09, "Dan Brickley"  wrote:

>On 26 February 2014 10:45, Joonas Suominen 
>wrote:
>> How about using RDFa and foaf:primaryTopic like in this example
>> https://en.wikipedia.org/wiki/RDFa#XHTML.2BRDFa_1.0_example
>>
>> 2014-02-26 20:18 GMT+02:00 Paul Houle :
>>
>>> Isn't there some way to do this with schema.org?
>
>The FOAF options were designed for relations between entities and
>documents -
>
>foaf:primaryTopic relates a Document to a thing that the doc is
>primarily about (i.e. assumes entity IDs as value, pedantically).
>
>the inverse, foaf:isPrimaryTopicOf, was designed to allow an entity
>description in a random page to anchor itself against well known
>pages. In particular we had Wikipedia in mind.
>
>http://xmlns.com/foaf/spec/#term_primaryTopic
>http://xmlns.com/foaf/spec/#term_isPrimaryTopicOf
>
>(Both of these share a classic Semantic Web pickyness about
>distinguishing things from pages about those things).
>
>Much more recently at schema.org we've added a new
>property/relationship called http://schema.org/sameAs
>
>It relates an entity to a reference page (e.g. wikipedia) that can be
>used as a kind of proxy identifier for the real world thing that it
>describes. Not to be confused with owl:sameAs which is for saying
>"here are two ways of identifying the exact same real world entity".
>
>None of these are a perfect fit for a relationship between a random
>Web page and a reference page. But maybe close enough?
>
>Both FOAF and schema.org are essentially dictionaries of
>hopefully-useful terms, so you can use them in HTML head, or body,
>according to taste, policy, tooling etc. And you can choose a syntax
>(microdata, rdfa, json-ld etc.).
>
>I'd recommend using the new schema.org 'sameAs', .e.g. in rdfa lite,
>
>https://en.wikipedia.org/wiki/Buckingham_Palace";
>property="http://schema.org/sameAs"; />
>
>This technically says "the thing we're describing in the current
>element is Buckingham_Palace. If you want to be more explicit and say
>"this Web page is about a real world Place and that place is
>Buckingham_Palace ... you can do this too with a bit more nesting; the
>HTML body might be a better place for it.
>
>Dan
>
>___
>Wikidata-l mailing list
>Wikidata-l@lists.wikimedia.org
>https://lists.wikimedia.org/mailman/listinfo/wikidata-l



-
http://www.bbc.co.uk
This e-mail (and any attachments) is confidential and
may contain personal views which are not the views of the BBC unless 
specifically stated.
If you have received it in
error, please delete it from your system.
Do not use, copy or disclose the
information in any way nor act in reliance on it and notify the sender
immediately.
Please note that the BBC monitors e-mails
sent or received.
Further communication will signify your consent to
this.
-

___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l


Re: [Wikidata-l] Wikidata RDF Issues

2013-10-22 Thread Michael Smethurst


On 21/10/2013 21:52, "Daniel Kinzler"  wrote:

>Am 21.10.2013 16:48, schrieb Kingsley Idehen:
>> Can someone not change 302 to 303 re: RewriteRule ^/entity/(.*)$
>> https://www.wikidata.org/wiki/Special:EntityData/$1 [R=302,QSA] ?
>
>
>The thing is that we intended this to be an internal apache rewrite, not
>a HTTP
>redirect at all. Because Special:EntityData itself implements the content
>negotiation that triggers a 303 when appropriate.
>
>So, currently we get a 302 from /entity/Q$1 to
>/wiki/Special:EntityData/$1 (the
>generic document URI), which then applies content negotiation and sends a
>303
>pointing to e.g. /wiki/Special:EntityData/$1.ttl (the URL of a specific
>serialization, e.g. in turtle).

Wondering why the 2nd step (conneg) returns a 303. Shouldn't it just be a
200 with a content location?
michael
>
>What I want is to remove the initial 302 completely using an internal
>rewrite,
>not replace it with another 303 - since I don't think that's semantically
>correct. This did not work when tried, for reasons unknown to me. Someone
>suggester that the wrong options where set for the rewrite rule, who
>knows.
>
>Kingsley, do you think having two 303s (from /entity/Q$1 to
>/wiki/Special:EntityData/$1 and another one to
>wiki/Special:EntityData/$1.xxx)
>would be appropriate or at least better than what we have now?
>
>-- daniel
>
>___
>Wikidata-l mailing list
>Wikidata-l@lists.wikimedia.org
>https://lists.wikimedia.org/mailman/listinfo/wikidata-l



-
http://www.bbc.co.uk
This e-mail (and any attachments) is confidential and
may contain personal views which are not the views of the BBC unless 
specifically stated.
If you have received it in
error, please delete it from your system.
Do not use, copy or disclose the
information in any way nor act in reliance on it and notify the sender
immediately.
Please note that the BBC monitors e-mails
sent or received.
Further communication will signify your consent to
this.
-

___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l


Re: [Wikidata-l] Data values

2012-12-20 Thread Michael Smethurst
Not quite on topic but on the subject of uncertainty around dates I've worked 
with a couple of data sets where birth and death dates were unknown but 
activity periods [1] were known. These have either had a separate flag called 
is_flourished (or similar) used to modify born / died or separate flourished 
dates from birth / death

Some flourished dates here:
http://en.wikipedia.org/wiki/List_of_British_architects
http://en.wikipedia.org/wiki/Template:Clan_Maclean_Chiefs


[1] http://en.wikipedia.org/wiki/Floruit

From: wikidata-l-boun...@lists.wikimedia.org 
[wikidata-l-boun...@lists.wikimedia.org] on behalf of Avenue 
[avenu...@gmail.com]
Sent: 20 December 2012 19:59
To: Discussion list for the Wikidata project.
Subject: Re: [Wikidata-l] Data values


Thanks, the prototype helps make some this more concrete.

I am increasingly wondering if "uncertainty" will be overloaded here. People 
seem to want to use it for various types of measurement uncertainty (e.g. the 
standard error), ranges with no defined central value, and distributional 
summaries (e.g. max and min), as well as for the precision with which a value 
is entered (as in the  "auto-certainty" value in the prototype). These are all 
quite different beasts, and conflating them will probably lead to problems - 
particularly for precision versus the rest. Which do we choose, if both apply? 
How will we know which is meant? Maybe marking "auto-certainty" values somehow 
would mitigate the latter problem, at least.

Avenue

On Thu, Dec 20, 2012 at 4:10 PM, Denny Vrandečić 
mailto:denny.vrande...@wikimedia.de>> wrote:
I am still trying to catch up with the whole discussion and to distill the 
results, both here and on the wiki.

In the meanwhile, I have tried to create a prototype of how a complex model can 
still be entered in a simple fashion. A simple demo can be found here:



The prototype is not i18n.

The user has to enter only the value, in a hopefully intuitive way (try it 
out), and the full interpretation is displayed here (that, alas, is not 
intuitive, admittedly).

Cheers,
Denny





2012/12/20 mailto:jmccl...@hypergrove.com>>


(Proposal 3, modified)
* value (xsd:double or xsd:decimal)

* unit (a wikidata item)

* totalDigits (xsd:smallint)
* fractionDigits (xsd:smallint)
* originalUnit (a wikidata item)
* originalUnitPrefix (a wikidata item)

JMc: I rearranged the list a bit and suggested simpler naming

JMc: Is not originalUnitPrefix directly derived from originalUnit?

JMc: May be more efficient to store not reconstruct the original value. May 
even be better to store the original value somewhere else entirely, earlier in 
the process, eg within the context that you indicate would be worthwhile to 
capture, because I wouldnt expect alot of retrievals, but you anticipate usage 
patterns certainly better than I.



How about just:

Datatype: .number  (Proposal 4)

-
  :value (xsd:double or xsd:decimal)

  :unit (a wikidata item)
  :totalDigits (xsd:smallint)
  :fractionDigits (xsd:smallint)

  :original (a wikidata item that is a number object)


On 20.12.2012 03:08, Gregor Hagedorn wrote:

On 20 December 2012 02:20,  
mailto:jmccl...@hypergrove.com>> wrote:

For me the question is how to name the precision information. Do not the XSD 
facets "totalDigits" and "fractionDigits" work well enough? I mean

Yes, that would be one way of modeling it. And I agree with you that,
although the xsd attributes originally are devised for datatypes,
there is nothing wrong with re-using it for quantities and
measurements.

So one way of expressing a measurement with significant digits is:
(Proposal 1)
* normalizedValue
* totalDigits
* fractionDigits
* originalUnit
* normalizedUnit

To recover the original information (e.g. that the original value was
in feet with a given number of significant digits) the software must
convert normalizedUnit to originalUnit, scale to totalDigits with
fractionDigits, calculate the remaining powers of ten, and use some
information that must be stored together with each unit whether this
then should be expressed using an SI unit prefix (the Exa, Tera, Giga,
Mega, kilo, hekto, deka, centi, etc.). Some units use them, others
not, and some units use only some. Hektoliter is common, hektometer
would be very odd. This is slightly complicated by the fact that for
some units prefix usage in lay topics differs from scientific use.

If all numbers were expressed ONLY as total digits with fraction
digits and unit-prefix, i.e. no power-of-ten exponential, the above
would be sufficiently complete. However, without additional
information it does not allow to recover the entry:

100,230 * 10^3 tons
(value 1.0023e8, 6 total, 3 fractional digits, original unit tons,
normalized unit gram)

I had therefore made (on the wiki) the proposal to express it as:

(Proposal 2)
* normalizedValue
* significantDigits (= and I am happy with totalD

Re: [Wikidata-l] Canonical URL for Wikidata pages?

2012-12-04 Thread Michael Smethurst
Hello

I've *finally* updated my wikidata URI pattern picture based on this and an 
earlier conversation from back in August [1]:

http://smethur.st/wikidata

(you need to click the image before it's readable)

Hoping it looks a little more correct. Certainly makes more sense in my head.

My only concern is that (like DBpedia) it seems to conflate the 303 (can't send 
you that) step with the content negotiation step.

So if:
http://wikidata.org/id/Q{id}
is the entity / non-information resource URI

and:
http://wikidata.org/wiki/Q{id}
http://wikidata.org/data/Q{id}.{format}.{lang} (or similar)
are information resource representation URIs

there's no generic information resource URI in the scheme.

In BBC-land we try to never expose the information resource representation URIs 
except as location headers. So

bbc.co.uk/programmes/:programme#programme
is the NIR

bbc.co.uk/programmes/:programme
is the generic IR which connegs to:

bbc.co.uk/programmes/:programme.html < desktop html
bbc.co.uk/programmes/:programme.mp < mobile html
bbc.co.uk/programmes/:programme.json < json
bbc.co.uk/programmes/:programme.xml < xml
bbc.co.uk/programmes/:programme.rdf < rdf
etc

and the .html, .mp, .json, .xml, .rdf are never exposed except as location 
headers for conneg on the generic IR and the #programme bit is never used 
except when we want to make RDF(a) statements about the NIR

Wondering if the 303 / conneg conflation will make it more difficult to 
understand what's going on, more difficult to work with and more difficult to 
host if you pick up a 303 for every link (as in DBpedia)

cheers
michael


[1] http://www.mail-archive.com/wikidata-l@lists.wikimedia.org/msg00858.html


-
http://www.bbc.co.uk
This e-mail (and any attachments) is confidential and
may contain personal views which are not the views of the BBC unless 
specifically stated.
If you have received it in
error, please delete it from your system.
Do not use, copy or disclose the
information in any way nor act in reliance on it and notify the sender
immediately.
Please note that the BBC monitors e-mails
sent or received.
Further communication will signify your consent to
this.
-

___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l


Re: [Wikidata-l] Wikidata links

2012-07-26 Thread Michael Smethurst
Very delayed reply but think I'm still confused on this. Made a picture to
clear my mind but not sure it works:
http://smethur.st/wikidata

The bit I think I get:
If I request
http://en.wikidata.org/wiki/Berlin
Or
http://en.wikidata.org/title/Berlin
I get a 301? to:
http://wikidata.org/title/en:Berlin
The html wiki page

But not sure I understand the machine readable part [1]

Bullet point 1 says
http://wikidata.org/id/Q{id}
Resolves to the appropriate url depending on the request header

Does resolve mean a redirect? Is that a 303?

Or is there no redirect and the "thing" uri returns content?

What's the "appropriate url"?
http://wikidata.org/data/Q{id}
Or
http://wikidata.org/data/Q{id}?format={format}&language={language}
?

Bullet point 2 says
http://en.wikidata.org/item/Berlin
Also resolves to the appropriate url. Is that a redirect? What's the
appropriate url?

Is there content negotiation happening from
http://wikidata.org/wiki/Q{id}

Or just from
http://wikidata.org/id/Q{id}

What happens if I request
http://wikidata.org/id/Q{id}
And accept only html?

Is there content negotiation from
http://wikidata.org/data/Q{id}
Or do I have to use parameters to get different representations?

Is there a better picture

Sorry to be thick
Michael

[1] 
https://meta.wikimedia.org/wiki/Wikidata/Notes/URI_scheme#Machine-readable_a
ccess


On 06/07/2012 18:20, "Gregor Hagedorn"  wrote:

> Thanks Denny, I largely see your points. The distinction between
> convenience = webservice to redirect to canonical URL and canonical
> URL could perhaps be made clearer in the note. I read it as parallel
> URIs rather than as a redirecting service. To me the word
> "convenience" has a different implication, but this may be entirely my
> fault, I am not a native speaker either. I also agree on the choice of
> language prefixes, confusing as it may be, I should have know. The
> data plus wikidata is still confusing, but I guess you cannot avoid
> that one?
> 
> 
> About the Q in front of identifiers: At the moment I see the item
> numbers being used in rdf:resource/about, but I understand that you
> may need them as element names? My understanding was that properties
> will be prefixed by Property: anyways.
> 
> In any event: I find the argument that a rare letter like Q is good
> branding not very convincing. I would suggest then a more memnonic
> choice, like WD2348972 or W2348972 instead. I believe the Q as prefix
> used in all canonical inbound links will be puzzling many people and
> end the explanation having to end up in the FAQ.
> 
> thanks again!
> 
> Gregor
> 
> ___
> Wikidata-l mailing list
> Wikidata-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata-l


http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal 
views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on 
it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.


___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l


Re: [Wikidata-l] DBpedia usage in the bbc - selected highlights

2012-07-05 Thread Michael Smethurst



On 05/07/2012 10:56, "Michael Hopwood"  wrote:

> Hello Michael, Nicholas et list,

Hi Michael

> 
> I hope you don't mind me jumping in here with a few comments on selected
> highlights of this thread.
> 
 Taking /music as an example...
> 
> I wonder if you have looked at book data? I am working on issues to do with
> linked (open?) book data and it would be useful to compare notes.

Not much. We've played around with ideas about linking programmes to books
(readings, reviews, dramatisations etc) and played with some book data.
Mostly it seems to make music metadata look sane and tidy :-/

> 
 wikipedia tends to conflate... composition with recording with release...
> 
> On the other hand, data does exist that separates these (and more!) entity
> types out very clearly, and it's potentially highly *linked* but it's unlikely
> to be *open*. See:
> 
> http://www.ddex.net/ddex-present - ddex descriptive data schemas, but also
> note the links there to IDs for
> 
> -names (ISNI)
> -compositions (ISWC)
> -recordings (ISRC)
> -releases (GRid)
> 
> These are all industry-standard IDs, and thus pretty stable. Maybe a starting
> point?

We have some industry identifiers internally and MusicBrainz has some
coverage of ISRCs. But they're all really just identity authorities. They
don't really deal with the links between entities which is what things like
MusicBrainz give us
> 
 ...domains where there's no established (open) authority (eg the equivalent
 of musicbrainz for films)...
> 
> EIDR? http://eidr.org/ - " EIDR is operated on a non-profit cost-recovery
> basis..." but maybe you get the stability and granularity you pay for? Plus;
> "... EIDR is founded on the principle of open participation and welcomes all
> ecosystem players (commercial and non-profit) to join the Registry as
> registrant, lookup user or even a promoter. The Registry is intended to
> provide a foundational namespace for A/V objects that can be leveraged by
> participant in the eco-system to further their own business needs and
> offerings." - http://eidr.org/resources/

Same story with eidr really. They're an identity authority rather than a
metadata service. They take just enough metadata to be able to effectively
spot duplicates. Which is handy but isn't linked data

(personal opinion is) one day all identifier schemes become http uris
because identifiers which link and can be dereferenced are more useful

Cheers
michael
> 
> Cheers,
> 
> Michael


http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal 
views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on 
it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.


___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l


Re: [Wikidata-l] DBpedia usage in the bbc

2012-07-04 Thread Michael Smethurst



On 04/07/2012 10:48, "Denny Vrandečić"  wrote:

> Hello Michael,
> 
> thank you for your input, this is extremely valuable.
> 
> In general I expect that Wikidata will serve your needs better than an
> extraction from Wikipedia could. First, yes, we will have more stable
> identifiers. Second, it should be better at identifying items of
> interest. Some of the reasons why several meanings are conflated into
> one article or spread over several articles in Wikipedia is that it
> simply makes sense for a text encyclopedia. I don't see a reason for
> Wikidata doing the same.
> 
> I do not expect Wikidata to solve all problems. In some glorious
> future, Wikidata will have a community. This community will decide on
> criteria for inclusion, both with regards to the coverage of items and
> with regards to what they are saying about them. The community will
> decide on the kind of sources they accept. Etc.
> 
> (Actually, "decide" is too nice a word for the process I expect will unfold...
> )
> 
> We will keep the problems you mentioned in mind, and I fully think
> that we will improve on every single one of them.

Look forward to seeing it unfold :-)
> 
> 2012/7/3 Michael Smethurst :
> 
>> So I think we'd be interested in wikidata for 2 (maybe 3) reasons:
>> 1. as a source of data for domains where there's no established (open)
>> authority (eg the equivalent of musicbrainz for films)
>> 2. as a better, more stable source of identifiers to triangulate to other
>> data sources
> 
> Yes, I expect that both use cases will be covered by Wikidata.
> 
>> ?3?. Possibly as a place to contribute of some of our data (eg we're
>> donating our classical music data to musicbrainz; there may be data we have
>> that would be useful to wikidata)
> 
> It will be up to the community to accept data donations -- the
> development team does not speak for the community.

Yes, that goes for musicbrainz too. We can offer data but it's up to the
community whether or not they accept it

> Personally I would
> be thrilled to see such donations happen. See also:
> 
> <http://meta.wikimedia.org/wiki/Wikidata/FAQ#I_have_a_lot_of_data_to_contribut
> e._How_can_I_do_that.3F>
> 
>> Have glanced quickly at the proposed wikidata uri scheme
>> (http://meta.wikimedia.org/wiki/Wikidata/Notes/URI_scheme#Proposal_for_Wikid
>> ata) and
>> 
>> http://{site}.wikidata.org/item/{Title} is a semi-persistent convenience URI
>> for the item about the article Title on the selected site
>> Semi-persistent refers to the fact that Wikipedia titles can change over
>> time, although this happens rarely
>> 
>> Not sure on the definition of infrequently but I know it's caused us
>> problems.
> 
> Fully agree. But they make for nice looking URIs.

Aesthetic concerns about uris tend to make me shiver :-)

> The canonical URI
> though is the ID-based one, and these are stable. The pretty ones are
> for convenience only. I will take a look at the note to see if this
> needs to be made more explicit.

Think it is explicit. Just that there's so many flavours of URI knocking
about it feels a bit confusing. The separation of the human readable and the
machine readable feels like it's following the dbpedia design pattern and
conflating the NIR > IR step with the content negotiation which feels (to
me) like a mistake.

Have talked about this is the past on the LOD list so to save typing:
http://lists.w3.org/Archives/Public/public-lod/2012Mar/0337.html

Not sure putting /data in a URI is ever a good idea. Shouldn't whether you
want data or not be decided by your accept headers. Same for ?format=json
etc.

For reference we use hash uris for things but only reference those in rdf
and never link to them. One information resource uri gets exposed in links /
the browser bar and does content negotiation for format (and eventually
language) and the response comes with content location header of the IR URI
dot the_format



> 
>> Wondering if the id in http://wikidata.org/id/Q{id} is the wikipedia row ID
>> (as used by dbpedialite)? Also wondering why there's a different set of URIs
>> for machine-readable access rather than just using content negotiation?
> 
> No it is not. There is no such thing as the "wikipedia row ID", what
> you mean is the "page ID on the English Wikipedia".

Ah, ok. Think someone once said that was the id of the underlying database
row of the page record. Looking at dbpedialite it seems it does only support
en.wikipedia

> As there are
> plenty of items that have articles only in Wikipedia other than
> English, a reliance on the English Page ID would be problematic. We
> i

Re: [Wikidata-l] DBpedia usage in the bbc

2012-07-04 Thread Michael Smethurst



On 03/07/2012 19:19, "Tom Morris"  wrote:

> On Tue, Jul 3, 2012 at 9:32 AM, Michael Smethurst
>  wrote:
> 
> I'm really looking forward to Wikidata, but it sounds like you might
> not be familiar with Freebase which already provides solutions to some
> of your problems today.

Hi Tom

Short answer is we are familiar with Freebase and we have talked about using
it but not done for a variety of reasons. Mainly because other data sets we
use (like MusicBrainz) tend to link to Wikipedia and not Freebase (except
through Wikipedia)

Should have probably split my list of problems into 2 parts:
- using the data from dbpedia
- using identifiers from dbpedia

Freebase would solve some of the data normalisation problems but as I said,
mainly we use dbpedia as a source of identifiers and for identifier
triangulation. We use a standard dump of Musicbrainz (rather than the
MusicBrainz data in Freebase) so to triangulate to Freebase we'd need to go
through Dbpedia and rely on their identifiers to be stable

Cheers
michael 


http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal 
views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on 
it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.


___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l


[Wikidata-l] DBpedia usage in the bbc

2012-07-03 Thread Michael Smethurst
Hello

A few notes on the BBC's use of DBpedia which Dan thought might be of
interest to this list:

Not sure how familiar you are with bbc web stuff so a brief introduction


We have a large and somewhat sprawling website with 2 main sections: news
article related stuff (including sports) and programme related stuff (tv and
radio). In between these sections are various other domain specific bits
(http://www.bbc.co.uk/music, http://www.bbc.co.uk/food,
http://www.bbc.co.uk/nature etc)

In the main we have actual content / data for news articles and programmes.
Most of the other bits of co.uk are really just different ways of cutting
this content / new aggregations. Because we don't have data for these
domains we borrow from elsewhere (mostly from the LOD cloud). So /music is
based on a backbone of musicbrainz data, /nature is based on numerous data
sources (open and not so open) all tied together with dbpedia identifiers...

In the main we don't really use dbpedia as a data source but rather as a
source of identifiers to triangulate with other data sources

So for example, we have 2 tools for "tagging" programmes with dbpedia
identifiers. Short clips are tagged with one tool using dbpedia information
resource uris, full episodes are tagged with another tool using dbpedia
non-information resource uris (< don't ask)

Taking /music as an example: because it's based on musicbrainz and because
musicbrainz includes wikipedia uris for artists we can easily derive dbpedia
uris (of whatever flavour) and query the programme systems for programmes
tagged with that artist's dbpedia uri



=== some problems we've found when using dbpedia ===

1. it's not really intended for use for data extraction. The semantics of
extraction depend on the infobox data and this isn't always applied
correctly. So http://en.wikipedia.org/wiki/Fox_News_Channel and
http://en.wikipedia.org/wiki/Fox_News_Channel_controversies share the same
main infobox meaning dbpedia sees them both as tv channels

2. wikipedia tends to conflate many objects into a single item / page. Eg
http://en.wikipedia.org/wiki/Penny_Lane has composer details, duration
details and release information conflating composition with recording with
release

3. the data extraction is a bit flakey in parts. Mainly because it's been
done by a small team and it covers so many different domains.

4. wikipedia doesn't do redirects properly. So
http://en.wikipedia.org/wiki/Spring_watch and
http://en.wikipedia.org/wiki/Autumn_watch are based on the same data /
return the same content and are flagged as a redirect internally but they
don't actually 30x. This is confusing for editorial staff knowing which uri
to "tag" with

5. wikipedia uris are derived from the article title. If the article title
changes the uri changes. Dbpedia uris are derived from wikipedia uris so
they also change when wikipedia uris / titles change. This has caused us no
end of upsets. An example: bbc.co.uk/nature uses wiki|dbpedia uri slugs. So
http://en.wikipedia.org/wiki/Stoat on wikipedia is
http://www.bbc.co.uk/nature/life/Stoat on bbc.co.uk
Apparently people in the UK call stoats stoats and people in the US call
them ermine (or the other way round) which lead to an edit war on wikipedia
which caused the dbpedia uri to flip repeatedly and our aggregations to
break. We've had similar problems with music artists (can't quite remember
the details but seem to remember some arguments about how the "and" should
appear in Florence and the Machine
http://en.wikipedia.org/wiki/Florence_and_the_Machine

6. Titles do change often enough to cause us problems. Particularly names
for people
Nic (cced) has done some work on dbpedia lite (http://dbpedialite.org/)
which aims to provide stable identifiers for dbpedia concepts based on (I
think) wikipedia table row identifiers (which wikimedia do claim are
guaranteed)

7. wikipedia has a policy that aims toward one outbound link per infobox. So
for a person or organisation page eg they tend to settle on that person /
orgs's homepage and not their social media accounts or web presence(s)
elsewhere. Which makes dbpedia less useful as an identifier triangulation
point

=== end of problems (at least the one's I can remember) ===

So I think we'd be interested in wikidata for 2 (maybe 3) reasons:
1. as a source of data for domains where there's no established (open)
authority (eg the equivalent of musicbrainz for films)
2. as a better, more stable source of identifiers to triangulate to other
data sources
?3?. Possibly as a place to contribute of some of our data (eg we're
donating our classical music data to musicbrainz; there may be data we have
that would be useful to wikidata)


Have glanced quickly at the proposed wikidata uri scheme
(http://meta.wikimedia.org/wiki/Wikidata/Notes/URI_scheme#Proposal_for_Wikid
ata) and 

http://{site}.wikidata.org/item/{Title} is a semi-persistent convenience URI
for the item about the article Title on the selected site
Semi-persistent refers t