Re: [Wikidata-tech] IRI-value or string-value for URLs?

Markus Krötzsch Thu, 29 Aug 2013 16:17:37 -0700

Dear all,

the main discussion Denny proposes is not the one that we have had onthe lists so far. Denny said that the use of the IRI datavalue typewould require us to use a specific serialisation format that shows,e.g., the protocol as a separate string. This is a detail of theinternal structure of the IRI datatype that we had not talked about yet.

Just to get this out of the way, let me explain briefly why IRIs areconsidered to consist of multiple strings in some data models (esp. inSMW). The main reason for representing IRIs as several strings(protocol, ...) internally is to aid validation, since these stringsallow different characters (also, protocol is case-insensitive while therest is case sensitive). This is why the SMW dataitem object for URIstakes multiple strings in its constructor.

However, this does not mean that you have to store the value as acompound object that contains many strings. In fact, this strikes me asa rather cumbersome approach that would make it harder to use the data.In SMW we store URIs as one string. Splitting this string into parts(under the assumption that it was a well-formed URL to start with) isquite easy, if this is needed (SMW does this). Conclusion: the use of adatatype for IRIs is in no way tied to the use of an impracticalserialisation; reference implementations exist.

So back to my original concern. The point of my email was to insist thatURIs need to be treated differently from strings in many importantapplications, and that it therefore makes sense to keep the knowledgeabout this difference in the data model. This only requires us to write"iri" instead of "string" as the datavalue type in the serialisation.That's all I was arguing for. This should also address Denny's one pointnot related to internal data structures (diffing could use the same codeas for strings).

The other discussion items that Denny brought up might be interesting atsome point, but I would rather focus on the immediate questions for now.Especially if we need to make a decision by Monday, we should narrow thediscussion down as much as possible. In particular, introducing theproperty datatype as an additional information into the external JSONformat would be a much more complex change, and at the same time wouldnot solve the problem (which was related to processing the JSON dumps).

I agree with Daniel that it would be better if the so-called "internal"format were really internal, but this is not the reality of Wikidatatoday. Even if we intend to replace the current dumps by new dumps thatuse "external" formats, we should make sure that our internal format isat least as specific as the basic external formats. In other words: theinternal format may contain auxiliary "internal" information and maybe"unofficial" values (like "bad", though this was not intended); but itshould also contain all the information that the most basic externalformats require. I strongly feel that the internal serialisation is (arepresentation of) the de facto data model, whatever we may writeelsewhere. Code is more powerful than words. Making strings into IRIsthere will make strings into IRIs everywhere. I don't think this wouldbe a good design for a data model today.


Cheers,

Markus

P.S. I also do not agree that the "IRI vs. string" question is equallyrelevant or equally clear as the "commons media vs. string vs. IRI"question. Commons media is an application-level datatype specific toWikimedia, while IRI and string are fundamental types in formats likeXML, RDF and OWL. Most programming languages have special handing forIRIs, comparable to special handling for times, even if neither is afundamental machine-level type. The question of Commons media is clearlymuch less important and should not be intertwined here.




On 29/08/13 16:41, Denny Vrandečić wrote:

We are planning to deploy URLs as data values rather soon (i.e.
September 9, if all goes well).

There was a discussion on wikidata-l mailing list:
<http://www.mail-archive.com/wikidata-l@lists.wikimedia.org/msg02664.html>

The current implementation for URLs uses a string data value. There was
also a IRI data value developed (for this use case), but in a previous
(internal) discussion it was decided to use string value instead.

The above thread included a few strong arguments by Markus for using the
IRI data value. If we want to do this, we need to decide that very
quickly, and change it accordingly.

Let's see if we can make the decision here on this list. We need to make
the decision by Monday latest, better earlier.

Here are my current thoughts (check also the above mentioned thread if
you did not have already). Currently I have a preference to using the
string value, just to point out my current bias, but I want wider input.

* I do not see the advantage of representing
'http://www.ietf.org/rfc/rfc1738.txt' as a structured data value of the
form { protocol : 'http', hierarchicalpart :
'www.ietf.org/rfc/rfc1738.txt <http://www.ietf.org/rfc/rfc1738.txt>',
query : '', fragment : '' }.

* If we use string value, a number of necessary features come for free,
like the diffing, displaying it in the diffs, etc. Sure, there is the
argument that we can use the getString method for these, but then what
is the use case that we actually serve by using the structured data?

* I understood the advantages of being able to *identify* whether the
value of a snak is a string or a URL, but that seems to be the same
advantages as for knowing whether the value of a snak is a Commons media
file name or a string. None of the the use cases though have been
explaining why using the above data structure is advantageous over a
simple string value.

Please let us collect the arguments for and against using the IRI data
value *structure* here (not for being able to *identify* whether a
string is an IRI or a string).

Not completely independent of that, there are a few questions that need
to be answered but that are not as immediate, i.e. do not have to be
decided by next week:

* should, in the external JSON structure, for every snak the data value
type be listed (as it currently is)? I.e. should it state "string"
instead of "Commons media filename"?

* should, in the external JSON structure, for every snak the data type
of the property used be listed? This would then say URL, and this would
solve all the use cases mentioned by Markus, which rely on *identifying*
this distinction, not on the actual IRI data structure.

* should, in the internal JSON structure, something be changed?

The external JSON structure is the one used when communicating through
the API.
The internal JSON structure is the one that you get when using the dumps.

We need to have an export of the whole Wikidata knowledge base in the
external JSON format, rather sooner than later, and hopefully also in
RDF. The lack of these dumps should not influence our decision right
now, imho :)

Cheers,
Denny

--
Project director Wikidata
Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin
Tel. +49-30-219 158 26-0 | http://wikimedia.de

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt
für Körperschaften I Berlin, Steuernummer 27/681/51985.


_______________________________________________
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech



_______________________________________________
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech

Re: [Wikidata-tech] IRI-value or string-value for URLs?

Reply via email to