RE: Performance Cost of Reification

Patrick Hoeffel Fri, 09 Oct 2015 09:10:40 -0700

Very well said, Andy. Thank you for taking the time to re-emphasize the 
importance of getting the data model right. I really appreciate it.

Patrick

-----Original Message-----
From: Andy Seaborne [mailto:a...@apache.org] 
Sent: Friday, October 09, 2015 8:21 AM
To: users@jena.apache.org
Subject: Re: Performance Cost of Reification

It is nice that the Titan guys see RDF as something to compare to. 
Coincidently, I was giving a talk about Property Graph / Linked Data just 
recently at the European ApacheCon BigData conference.

The Property Graph (PG) market is maybe x2 the size of the RDF market, 
and both are small.  The challenge is growing the graph market, not one 
form taking market share away from the other.

And the key difference between graph databases (either kind) and other 
data systems is the approach to data modelling.  The differences between 
graph systems are not the key here.

About reification, they are somewhat off-track.  Reification is a quite 
specialised feature for limited use. It is not RDF's equivalent to 
attributes on links in PG.

Let me make that concrete with an example simplified from Graph 
databases / chapter 3 (page 52 in my copy).  The book is written the 
Neo4J folks.

Email provenance.

     A sends_email_to B

Now, you could reify that statement (the act by A of sending the email 
to B).

Reification is way more powerful than just being about to add data to 
the triple.  It says "claim: A sends_mail_to B"  - several different and 
competing claims can be made. But let's continue assuming reification 
and assertion of the triple ... [*]

<<A sends email to B>>
     cc C
     cc D
     sentOn Tuesday

In the same modelling way you could add attributes to a PG graph edge 
for sends_email_to.

Both PG and RDF modelling here are anti-patterns (as chapter 3 notes for 
PG).

The email sent is an important concept so model it explicitly:

A   sends       MSG
MSG receivedBy  B
MSG cc_to       C
MSG cc_to       D
MSG sentOn      "Tuesday"

By modelling the email message as a first class concept, not implicit in 
the activity via reification/link attributes, you can better add 
information e.g. which servers it was transferred by and stored on, when 
was it received (this is email - that might be twice) and better query 
it (who else accessed it on receipt).  Modelling those on the act of 
sending is making life hard (how do you talk about a draft email?)

MSG contents        URL_to_content
MSG hasChecksum     0xABCDEF
MSG status          :sent

This event based modelling.

If you wanted a highly efficient reification-supporting RDF store, then 
build one.  No need to blindly store as multiple triples (its called 
compression!).  You don't see such stores because reification is a minor 
feature of RDF.  Event-based modelling and named graphs are often better.

     Andy

[*]
<< >> is syntax that I proposed in early SPARQL drafts pre 1.0 for 
reification support but didn't gain much support. It is still in the ARQ 
parser source but not active.

RE: Performance Cost of Reification

Reply via email to