[Wikidata-bugs] [Maniphest] [Commented On] T244341: Wikibase RDF dump: stop using blank nodes for encoding unknown values and OWL constraints

mkroetzsch Fri, 07 Feb 2020 07:26:27 -0800

mkroetzsch added a comment.


  Hi,
  
  Using the same value for "unknown" is a very bad idea and should not be 
considered. You already found out why. This highlights another general design 
principle: the RDF data should encode meaning in structure in a direct way. If 
two triples have the same RDF term as object, then they should represent 
relationships to the same thing, without any further conditions on the shape of 
that term. Otherwise, SPARQL does not work well. For example, the property 
paths you can write with * have no way of performing extra tests on the nodes 
you traverse, so the meaning of a chain must not be influenced by the shape of 
the terms on a property chain, if you want to use * in queries in a meaningful 
way.
  
  This principle is also why we chose bnodes in the first place. OWL also has a 
standard way of encoding the information that some property has an 
(unspecified) value, but the encoding of this looks more like what we have in 
the case of negation (no value) now. If we had used this, one would need a 
completely different query pattern to find people with unspecified date of 
death and for people with specified date of death. In contrast, the current 
bnode encoding allows you to ask a query for everybody with a date of death 
without having to know if it is given explicitly or left unspecified (you don't 
even have to know that the latter is possible). This should be kept in mind: 
the encoding is not just for "use cases" where you are interested in the 
special situation (e.g., someone having unspecified date of death) but also for 
all other queries dealing with data of some kind. For this reason, the RDF 
structure for encoding unspecified values should as much as possible look as 
the cases where there are values.
  
  I am not aware of any other option for encoding "there is a value but we know 
nothing more about it" in RDF or OWL besides the two options I mentioned. The 
proposal to use a made-up IRI instead of a bnode gives identity to the unkown 
(even if that identity has no meaning in our data yet). It works in many 
unspecified-value use cases where bnodes work, but not in all. The three main 
confusions possible are:
  
  1. confusing a placeholder "unspecified" IRI with a real IRI that is expected 
in normal cases (imagine using a FILTER on URL-type property values),
  2. believing that the data changed when only the placeholder IRI has changed 
(imagine someone deleting and re-adding a quantifier with "unspecified" -- if 
it's a bnode, the outcome is the same in terms of RDF semantics, but if you use 
placeholder IRIs, you need to know their special meaning to compare the two RDF 
data sets correctly)
  3. accidental or deliberate uses of placeholder IRIs in other places (imagine 
somebody puts your placeholders as value into a URL-type property)
  
  Case 3 can probably be disallowed by the software (if one thinks of it).
  
  Another technical issue with the approach is that you would need to use 
placeholder IRIs also with datatype properties that normally require RDF 
literals. RDF engines will tolerate this, and for SPARQL use cases it's not a 
huge difference from tolerating bnodes there. But it does put the data outside 
of OWL, which does not allow properties to be for literals and IRIs at the same 
time. Unfortunately, there is no equivalent of creating a placeholder IRI for 
things like xsd:int or xsd:string in RDF (in OWL, you can write this with a 
class expression, but it will be structurally different from other cases where 
this data is set).
  
  For the encoding of OWL negation, I am not sure if switching this (internal, 
structure) bnode to a (generated, unique) IRI would make any difference. One 
would have to check with the standard to see if this is allowed. I would 
imagine that it just works. In this case, sharing the same auxiliary IRI 
between all negative statements that refer to the same property should also 
work.
  
  So: dropping in placeholder IRIs is the "second best thing" to encode bnodes, 
but it gives up several advantages and introduces some problems (and of course 
inevitably breaks existing queries). Before doing such a change, there should 
be a clearer argument as to why this would help, and in which cases. The linked 
PDF that is posted here for motivation does not speak about updates, and indeed 
if you look at Aidan's work, he has done a lot of interesting analysis with 
bnodes that would not make any sense without them (e.g., related to comparing 
RDF datasets; related to my point 2 above). I am not a big fan of bnodes 
either, but what we try to encode here is what they have genuinely been 
invented for, and any alternative also has its issues.

TASK DETAIL
  https://phabricator.wikimedia.org/T244341

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: mkroetzsch
Cc: mkroetzsch, Denny, Lucas_Werkmeister_WMDE, Aklapper, dcausse, 
darthmon_wmde, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Smalyshev, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331

_______________________________________________
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

[Wikidata-bugs] [Maniphest] [Commented On] T244341: Wikibase RDF dump: stop using blank nodes for encoding unknown values and OWL constraints

Reply via email to