[Virtuoso-devel] Annotating RDF triples with metadata

Martin Gerlach Mon, 29 Mar 2010 14:34:45 +0000

Hello all,

--- questions about extending VOS contained below, pls read through the
leading text briefly, it provides background for the questions ---


for a research project we currently have to make some decisions
regarding ontology modeling.

We would like to invite you to discuss some general issues and are
interested in your experiences and ideas.

To give you an idea on kind and size of the ontology to be modeled:
Until now we have been working with the DBpedia ontology with linked
data added from Freebase, Geonames, and Yago concerning resources of
rdf:type Person, Place, Event, Work and Organisation. Thus we are
dealing with about 100 million triples corresponding to approx. 7.6
million resources.

We are trying to devise a system for incrementally adding facts to the
target ontology using the sources mentioned above plus additional linked
data. We are also going to enable end users of our system to add facts.

Moreover we would like to annotate certain facts (i.e., triples, or
groups of triples) with the following pieces of metadata:
- Source (e.g., "DBpedia", "Freebase", "<username>")
- Temporal information, e.g. <Albert_Einstein> <spouse> <Elsa_Einstein>
[start:1919, end: 1936]

Eventually, this might be extended to additionally comprise the following:
- Timestamp (when the triple/s was/were added)
- Confidence value, e.g., in the case a fact was extracted from fulltext
by a text mining algorithm that provides a measurement of confidence

Our non-functional requirements mainly focus on querying with high
performance. This includes that the amount of data should not become to
big - ideally, the whole ontology can be loaded into RAM.

It's furthermore preferable to stay with RDF but performance issues
generally have the higher priority for us.

We then would like to query for data within time ranges, e.g.

"All facts valid between X and Y"

Optionally in combination with other metadata ("source=z,
confidence>60") or type filtering (e.g., only relations between Person
and Place) and so on.

We are currently evaluating the following approaches to metadata:

1) Annotating metadata via Named Graphs

1.1) Create a graph FOR EACH TRIPLE containing all individual metadata

example:
    Graph <g_Albert_Einstein_spouse_Elsa_Einstein> containing just
triple: <Albert_Einstein> <spouse> <Elsa_Einstein>
    <g_Albert_Einstein_spouse_Elsa_Einstein> <source> <dbpedia>.
    <g_Albert_Einstein_spouse_Elsa_Einstein> <startdate>
"1919-01-01"^^xsd:date.
    <g_Albert_Einstein_spouse_Elsa_Einstein> <enddate>
"1936-12-31"^^xsd:date.
    [more metadata possible]

querying for a time interval would be (triples within 20th century):
    select ?graph ?s ?p ?o
    where {
        ?graph <startdate> ?start.
        ?graph <enddate> ?end.
        filter (?start > "1900-01-01" and ?start < "2000-01-01" and ?end
> "1900-01-01" and ?end < "2000-01-01")
        ?graph { ?s ?p ?o }
    }


pro: Each triple can have any metadata, therefore it's possible to
define many optional values (like confidence, which we will have for few
triples only)
con: Huge amount of metadata triples (about 400 - 500 mio. metadata
triples for 100 mio fact triples)
Is it possible to query this performantly assuming that everything fits
into main memory?

1.2) Create a graph FOR SEVERAL TRIPLES sharing same metadata

Triples with the same combination of metadata values share the same RDF
graph.

The paper "Efficient Temporal Querying of RDF Data with SPARQL"
http://www.ifi.uzh.ch/pax/uploads/pdf/publication/1004/tappolet09applied.pdf
explains how to annotate triples with time intervals using named graphs.
So this approach would also fit into 1.2

One could imagine to combine metadata properties, e.g. by creating a
graph containing all triples coming from source dbpedia and having the
exact time interval 1919-01-01 and 1936-01-01.

pro: Less metadata triples in comparison to 1.1
con: Clumsy with many types of metadata. Also, when inserting data, we
need an efficient way of detecting whether a graph for a certain
combination of metadata already exists or not (hashing, querying, ...)


2) Annotating metadata via an N-TUPLE STORE

In the mailing list there once was the rumor that it would be (easily)
possible to extend Virtuoso's quad store by more columns.

Under these circumstances one could create a new column for each
metadata property, at least for frequently used properties. Thinking
about this approach, further questions arise:
- Is it in general a good idea to do this (would you recommend it)?
- Has anybody done this before?
- Is it possible to extend the SPARQL syntax in order to be able to
continue using SPARQL?
- What about performance issues?

pro: We assume that performance will be good (with additional indices).
The amount of data will hopefully be acceptable (on the one hand there
is no aggregation: each triple contains it's own metadata, on the other
hand metadata values are not saved as "expensive" types or even triples)
con: We would definetely leave standards. So data isn't interchangable
independently of the store anymore (which would be acceptable for us
because we want to provide our data along with an own Web service). And
we would need adaptions to Virtuoso - where it is not clear to which extent.


3) Annotating metadata WITHIN THE ONTOLOGY

3.1) Classical N-Ary Approach: Inserting arbitrary entities

Ref.: http://www.w3.org/TR/swbp-n-aryRelations/

In practice n-ary relations can be modeled in different ways:

Regarding our example: Due to the fact, that the <spouse> property is
reflexive (a prop b ==> b prop a), we could write following:

    <Spouse_123> <member> <Albert_Einstein> .
    <Spouse_123> <member> <Elsa_Einstein> .
    <Spouse_123> <source> <dbpedia> .
    <Spouse_123> <startdate> "1919-01-01"^^xsd:date .
    <Spouse_123> <enddate> "1936-12-31"^^xsd:date .

In this case <Albert_Einstein> and <Elsa_Einstein> are equal members to
the relation. Instead of <Spouse_123> we could also use a blank node.

pro: It seems to be a state of the art approach. This way one can model
any metadata and even express reflexive and inverse facts quite easy. In
comparison with reification,  less triples are needed.
con: The original triple isn't kept. So the structure changes for
triples beeing annotated - one has to care for such while querying.
General queries like "How many persons are directly connected to other
persons?" can not be queried easily anymore. One could of course keep
the original triple at the price of increasing the overall number of
triples.

3.2) Our own approach: Inserting sub properties

Let's just show an example:

    <Albert_Einstein> <spouse_123> <Elsa_Einstein> .
    <spouse_123> rdfs:subPropertyOf <spouse> .
    <spouse_123> <source> <dbpedia> .
    <spouse_123> <startdate> "1919-01-01"^^xsd:date .
    <spouse_123> <enddate> "1936-12-31"^^xsd:date .

So for each relationship between two entities we need a new property
that we connect to the original property via rdfs:subPropertyOf.

pro: Relationships between resources (like <Albert_Einstein> and
<Elsa_Einstein>) stay directly connected, so it is easy to query them.
Only if we are also interested in a special property (or some of the
metadata values), we additionally have to query for the according
subproperty relation. Under some cirumstances we need less triples than
the classical n-ary approach.
con: For every A-Box relationship we define a new property, which is
actually a part of the T-Box. On this way this approach violates the
separation of A-Box and T-Box (which is conceptual problem, not a
technical one).

3.3) Reification / Annotation Properties

Ref.: http://www.w3.org/TR/2004/REC-rdf-primer-20040210/#reification
Ref.: http://www.w3.org/TR/owl-ref/#Annotations
Ref.: http://www.w3.org/TR/owl2-primer/#Annotating_Axioms_and_Entities

With reification our example would look like this:
    <Albert_Einstein> <spouse> <Elsa_Einstein> .
    <statement_123> rdf:type rdf:Statement .
    <statement_123> rdf:subject <Albert_Einstein> .
    <statement_123> rdf:predicate <spouse> .
    <statement_123> rdf:object <Elsa_Einstein> .
    <statement_123> <source> <dbpedia> .
    <statement_123> <startdate> "1919-01-01"^^xsd:date .
    <statement_123> <enddate> "1936-12-31"^^xsd:date .

This approach does not seem to be a good option, because there are 4
triples needed just to define a new statement entity. If every triple
was to be annotated, we would end up with about 400 mio. triples without
the actual metadata!

Annotation properties seem to be a very similar approach using the OWL
namespace.

pro: We keep the original triple.
con: It blows up the amount of data unacceptably and along with this, it
requires complexer queries than using the n-ary approaches.

We think that it would a good idea to combine some of the approaches,
e.g. by using Named Graphs for annotating the source of facts and using
n-ary approaches for the rest of the metadata.

To sum up our questions we would like to know:
- What about the n-tuple approach? Can Virtuoso be extended to handle
n-tuples within Graphs, rather than triples? Which adaptions would have
to be done and which query performance could be expected compared to
SPARQL on Triples/Quads?
- About named graphs: When and how would you use this approach or how
did you use it?
- Have you made experiences with n-ary approaches - are there
problems/disadvantages with it (e.g. performance issues)?
- Do you have any other ideas of storing metadata for RDF triples?

Thanks in advance,
-- 
--------------------------------
Martin Gerlach
Softwareentwicklung

neofonie
Technologieentwicklung und
Informationsmanagement GmbH
Robert-Koch-Platz 4
10115 Berlin
fon: +49.30 24627 413
fax: +49.30 24627 120
[email protected]
http://www.neofonie.de

Handelsregister
Berlin-Charlottenburg: HRB 67460

Geschaeftsfuehrung
Helmut Hoffer von Ankershoffen
(Sprecher der Geschaeftsfuehrung)
Nurhan Yildirim
-------------------------------

WePad

Das WePad ist ein Tablet der neuesten Generation. Dem Nutzer bietet es
schnellen Zugang zum Internet, eine komplette Welt von sofort nutzbaren
Applikationen und einfachen Zugriff auf Bücher, Fotos sowie auf Magazine
und Tageszeitungen verschiedener Verlage, die mit dem WeMagazine
ePublishing Eco System realisiert wurden. Mehr über das WePad auf
www.wepad.mobi oder auf www.facebook.com/WePad.

Hello all,

for a research project we currently have to make some decisions regarding 
ontology modeling.

We would like to invite you to discuss some general issues and are interested 
in your experiences and ideas.

To give you an idea on kind and size of the ontology to be modeled: Until now 
we have been working with the DBpedia ontology with linked data added from 
Freebase, Geonames, and Yago concerning resources of rdf:type Person, Place, 
Event, Work and Organisation. Thus we are dealing with about 100 million 
triples corresponding to approx. 7.6 million resources.

We are trying to devise a system for incrementally adding facts to the target 
ontology using the sources mentioned above plus additional linked data. We are 
also going to enable end users of our system to add facts.

Moreover we would like to annotate certain facts (i.e., triples, or groups of 
triples) with the following pieces of metadata:
- Source (e.g., "DBpedia", "Freebase", "<username>")
- Temporal information, e.g. <Albert_Einstein> <spouse> <Elsa_Einstein> 
[start:1919, end: 1936]

Eventually, this might be extended to additionally comprise the following:
- Timestamp (when the triple/s was/were added)
- Confidence value, e.g., in the case a fact was extracted from fulltext by a 
text mining algorithm that provides a measurement of confidence

Our non-functional requirements mainly focus on querying with high performance. 
This includes that the amount of data should not become to big - ideally, the 
whole ontology can be loaded into RAM.

It's furthermore preferable to stay with RDF but performance issues generally 
have the higher priority for us.

We then would like to query for data within time ranges, e.g.

"All facts valid between X and Y"

Optionally in combination with other metadata ("source=z, confidence>60") or 
type filtering (e.g., only relations between Person and Place) and so on.

We are currently evaluating the following approaches to metadata:

1) Annotating metadata via Named Graphs

1.1) Create a graph FOR EACH TRIPLE containing all individual metadata

example:
    Graph <g_Albert_Einstein_spouse_Elsa_Einstein> containing just triple: 
<Albert_Einstein> <spouse> <Elsa_Einstein>
    <g_Albert_Einstein_spouse_Elsa_Einstein> <source> <dbpedia>.
    <g_Albert_Einstein_spouse_Elsa_Einstein> <startdate> "1919-01-01"^^xsd:date.
    <g_Albert_Einstein_spouse_Elsa_Einstein> <enddate> "1936-12-31"^^xsd:date.
    [more metadata possible]

querying for a time interval would be (triples within 20th century):
    select ?graph ?s ?p ?o
    where {
        ?graph <startdate> ?start.
        ?graph <enddate> ?end.
        filter (?start > "1900-01-01" and ?start < "2000-01-01" and ?end > 
"1900-01-01" and ?end < "2000-01-01")
        ?graph { ?s ?p ?o }
    }


pro: Each triple can have any metadata, therefore it's possible to define many 
optional values (like confidence, which we will have for few triples only)
con: Huge amount of metadata triples (about 400 - 500 mio. metadata triples for 
100 mio fact triples)
Is it possible to query this performantly assuming that everything fits into 
main memory?

1.2) Create a graph FOR SEVERAL TRIPLES sharing same metadata

Triples with the same combination of metadata values share the same RDF graph.

The paper "Efficient Temporal Querying of RDF Data with SPARQL" 
http://www.ifi.uzh.ch/pax/uploads/pdf/publication/1004/tappolet09applied.pdf 
explains how to annotate triples with time intervals using named graphs. So 
this approach would also fit into 1.2

One could imagine to combine metadata properties, e.g. by creating a graph 
containing all triples coming from source dbpedia and having the exact time 
interval 1919-01-01 and 1936-01-01.

pro: Less metadata triples in comparison to 1.1
con: Clumsy with many types of metadata. Also, when inserting data, we need an 
efficient way of detecting whether a graph for a certain combination of 
metadata already exists or not (hashing, querying, ...)


2) Annotating metadata via an N-TUPLE STORE

In the mailing list there once was the rumor that it would be (easily) possible 
to extend Virtuoso's quad store by more columns. 

Under these circumstances one could create a new column for each metadata 
property, at least for frequently used properties. Thinking about this 
approach, further questions arise:
- Is it in general a good idea to do this (would you recommend it)? 
- Has anybody done this before? 
- Is it possible to extend the SPARQL syntax in order to be able to continue 
using SPARQL?
- What about performance issues?

pro: We assume that performance will be good (with additional indices). The 
amount of data will hopefully be acceptable (on the one hand there is no 
aggregation: each triple contains it's own metadata, on the other hand metadata 
values are not saved as "expensive" types or even triples)
con: We would definetely leave standards. So data isn't interchangable 
independently of the store anymore (which would be acceptable for us because we 
want to provide our data along with an own Web service). And we would need 
adaptions to Virtuoso - where it is not clear to which extent.


3) Annotating metadata WITHIN THE ONTOLOGY

3.1) Classical N-Ary Approach: Inserting arbitrary entities 

Ref.: http://www.w3.org/TR/swbp-n-aryRelations/

In practice n-ary relations can be modeled in different ways:

Regarding our example: Due to the fact, that the <spouse> property is reflexive 
(a prop b ==> b prop a), we could write following:

    <Spouse_123> <member> <Albert_Einstein> .
    <Spouse_123> <member> <Elsa_Einstein> .
    <Spouse_123> <source> <dbpedia> .
    <Spouse_123> <startdate> "1919-01-01"^^xsd:date .
    <Spouse_123> <enddate> "1936-12-31"^^xsd:date .

In this case <Albert_Einstein> and <Elsa_Einstein> are equal members to the 
relation. Instead of <Spouse_123> we could also use a blank node.

pro: It seems to be a state of the art approach. This way one can model any 
metadata and even express reflexive and inverse facts quite easy. In comparison 
with reification,  less triples are needed.
con: The original triple isn't kept. So the structure changes for triples 
beeing annotated - one has to care for such while querying. General queries 
like "How many persons are directly connected to other persons?" can not be 
queried easily anymore. One could of course keep the original triple at the 
price of increasing the overall number of triples.

3.2) Our own approach: Inserting sub properties

Let's just show an example:

    <Albert_Einstein> <spouse_123> <Elsa_Einstein> .
    <spouse_123> rdfs:subPropertyOf <spouse> .
    <spouse_123> <source> <dbpedia> .
    <spouse_123> <startdate> "1919-01-01"^^xsd:date .
    <spouse_123> <enddate> "1936-12-31"^^xsd:date .

So for each relationship between two entities we need a new property that we 
connect to the original property via rdfs:subPropertyOf.

pro: Relationships between resources (like <Albert_Einstein> and 
<Elsa_Einstein>) stay directly connected, so it is easy to query them. Only if 
we are also interested in a special property (or some of the metadata values), 
we additionally have to query for the according subproperty relation. Under 
some cirumstances we need less triples than the classical n-ary approach.
con: For every A-Box relationship we define a new property, which is actually a 
part of the T-Box. On this way this approach violates the separation of A-Box 
and T-Box (which is conceptual problem, not a technical one).

3.3) Reification / Annotation Properties

Ref.: http://www.w3.org/TR/2004/REC-rdf-primer-20040210/#reification
Ref.: http://www.w3.org/TR/owl-ref/#Annotations
Ref.: http://www.w3.org/TR/owl2-primer/#Annotating_Axioms_and_Entities

With reification our example would look like this:
    <Albert_Einstein> <spouse> <Elsa_Einstein> .
    <statement_123> rdf:type rdf:Statement .
    <statement_123> rdf:subject <Albert_Einstein> .
    <statement_123> rdf:predicate <spouse> .
    <statement_123> rdf:object <Elsa_Einstein> .
    <statement_123> <source> <dbpedia> .
    <statement_123> <startdate> "1919-01-01"^^xsd:date .
    <statement_123> <enddate> "1936-12-31"^^xsd:date .

This approach does not seem to be a good option, because there are 4 triples 
needed just to define a new statement entity. If every triple was to be 
annotated, we would end up with about 400 mio. triples without the actual 
metadata!

Annotation properties seem to be a very similar approach using the OWL 
namespace.

pro: We keep the original triple.
con: It blows up the amount of data unacceptably and along with this, it 
requires complexer queries than using the n-ary approaches.

We think that it would a good idea to combine some of the approaches, e.g. by 
using Named Graphs for annotating the source of facts and using n-ary 
approaches for the rest of the metadata.

To sum up our questions we would like to know:
- What about the n-tuple approach? Can Virtuoso be extended to handle n-tuples 
within Graphs, rather than triples? Which adaptions would have to be done and 
which query performance could be expected compared to SPARQL on Triples/Quads?
- About named graphs: When and how would you use this approach or how did you 
use it?
- Have you made experiences with n-ary approaches - are there 
problems/disadvantages with it (e.g. performance issues)?
- Do you have any other ideas of storing metadata for RDF triples?

Thanks in advance,
...
[neo-footer]

[Virtuoso-devel] Annotating RDF triples with metadata

Reply via email to