Hello all,
--- questions about extending VOS contained below, pls read through the
leading text briefly, it provides background for the questions ---
for a research project we currently have to make some decisions
regarding ontology modeling.
We would like to invite you to discuss some general issues and are
interested in your experiences and ideas.
To give you an idea on kind and size of the ontology to be modeled:
Until now we have been working with the DBpedia ontology with linked
data added from Freebase, Geonames, and Yago concerning resources of
rdf:type Person, Place, Event, Work and Organisation. Thus we are
dealing with about 100 million triples corresponding to approx. 7.6
million resources.
We are trying to devise a system for incrementally adding facts to the
target ontology using the sources mentioned above plus additional linked
data. We are also going to enable end users of our system to add facts.
Moreover we would like to annotate certain facts (i.e., triples, or
groups of triples) with the following pieces of metadata:
- Source (e.g., "DBpedia", "Freebase", "<username>")
- Temporal information, e.g. <Albert_Einstein> <spouse> <Elsa_Einstein>
[start:1919, end: 1936]
Eventually, this might be extended to additionally comprise the following:
- Timestamp (when the triple/s was/were added)
- Confidence value, e.g., in the case a fact was extracted from fulltext
by a text mining algorithm that provides a measurement of confidence
Our non-functional requirements mainly focus on querying with high
performance. This includes that the amount of data should not become to
big - ideally, the whole ontology can be loaded into RAM.
It's furthermore preferable to stay with RDF but performance issues
generally have the higher priority for us.
We then would like to query for data within time ranges, e.g.
"All facts valid between X and Y"
Optionally in combination with other metadata ("source=z,
confidence>60") or type filtering (e.g., only relations between Person
and Place) and so on.
We are currently evaluating the following approaches to metadata:
1) Annotating metadata via Named Graphs
1.1) Create a graph FOR EACH TRIPLE containing all individual metadata
example:
Graph <g_Albert_Einstein_spouse_Elsa_Einstein> containing just
triple: <Albert_Einstein> <spouse> <Elsa_Einstein>
<g_Albert_Einstein_spouse_Elsa_Einstein> <source> <dbpedia>.
<g_Albert_Einstein_spouse_Elsa_Einstein> <startdate>
"1919-01-01"^^xsd:date.
<g_Albert_Einstein_spouse_Elsa_Einstein> <enddate>
"1936-12-31"^^xsd:date.
[more metadata possible]
querying for a time interval would be (triples within 20th century):
select ?graph ?s ?p ?o
where {
?graph <startdate> ?start.
?graph <enddate> ?end.
filter (?start > "1900-01-01" and ?start < "2000-01-01" and ?end
> "1900-01-01" and ?end < "2000-01-01")
?graph { ?s ?p ?o }
}
pro: Each triple can have any metadata, therefore it's possible to
define many optional values (like confidence, which we will have for few
triples only)
con: Huge amount of metadata triples (about 400 - 500 mio. metadata
triples for 100 mio fact triples)
Is it possible to query this performantly assuming that everything fits
into main memory?
1.2) Create a graph FOR SEVERAL TRIPLES sharing same metadata
Triples with the same combination of metadata values share the same RDF
graph.
The paper "Efficient Temporal Querying of RDF Data with SPARQL"
http://www.ifi.uzh.ch/pax/uploads/pdf/publication/1004/tappolet09applied.pdf
explains how to annotate triples with time intervals using named graphs.
So this approach would also fit into 1.2
One could imagine to combine metadata properties, e.g. by creating a
graph containing all triples coming from source dbpedia and having the
exact time interval 1919-01-01 and 1936-01-01.
pro: Less metadata triples in comparison to 1.1
con: Clumsy with many types of metadata. Also, when inserting data, we
need an efficient way of detecting whether a graph for a certain
combination of metadata already exists or not (hashing, querying, ...)
2) Annotating metadata via an N-TUPLE STORE
In the mailing list there once was the rumor that it would be (easily)
possible to extend Virtuoso's quad store by more columns.
Under these circumstances one could create a new column for each
metadata property, at least for frequently used properties. Thinking
about this approach, further questions arise:
- Is it in general a good idea to do this (would you recommend it)?
- Has anybody done this before?
- Is it possible to extend the SPARQL syntax in order to be able to
continue using SPARQL?
- What about performance issues?
pro: We assume that performance will be good (with additional indices).
The amount of data will hopefully be acceptable (on the one hand there
is no aggregation: each triple contains it's own metadata, on the other
hand metadata values are not saved as "expensive" types or even triples)
con: We would definetely leave standards. So data isn't interchangable
independently of the store anymore (which would be acceptable for us
because we want to provide our data along with an own Web service). And
we would need adaptions to Virtuoso - where it is not clear to which extent.
3) Annotating metadata WITHIN THE ONTOLOGY
3.1) Classical N-Ary Approach: Inserting arbitrary entities
Ref.: http://www.w3.org/TR/swbp-n-aryRelations/
In practice n-ary relations can be modeled in different ways:
Regarding our example: Due to the fact, that the <spouse> property is
reflexive (a prop b ==> b prop a), we could write following:
<Spouse_123> <member> <Albert_Einstein> .
<Spouse_123> <member> <Elsa_Einstein> .
<Spouse_123> <source> <dbpedia> .
<Spouse_123> <startdate> "1919-01-01"^^xsd:date .
<Spouse_123> <enddate> "1936-12-31"^^xsd:date .
In this case <Albert_Einstein> and <Elsa_Einstein> are equal members to
the relation. Instead of <Spouse_123> we could also use a blank node.
pro: It seems to be a state of the art approach. This way one can model
any metadata and even express reflexive and inverse facts quite easy. In
comparison with reification, less triples are needed.
con: The original triple isn't kept. So the structure changes for
triples beeing annotated - one has to care for such while querying.
General queries like "How many persons are directly connected to other
persons?" can not be queried easily anymore. One could of course keep
the original triple at the price of increasing the overall number of
triples.
3.2) Our own approach: Inserting sub properties
Let's just show an example:
<Albert_Einstein> <spouse_123> <Elsa_Einstein> .
<spouse_123> rdfs:subPropertyOf <spouse> .
<spouse_123> <source> <dbpedia> .
<spouse_123> <startdate> "1919-01-01"^^xsd:date .
<spouse_123> <enddate> "1936-12-31"^^xsd:date .
So for each relationship between two entities we need a new property
that we connect to the original property via rdfs:subPropertyOf.
pro: Relationships between resources (like <Albert_Einstein> and
<Elsa_Einstein>) stay directly connected, so it is easy to query them.
Only if we are also interested in a special property (or some of the
metadata values), we additionally have to query for the according
subproperty relation. Under some cirumstances we need less triples than
the classical n-ary approach.
con: For every A-Box relationship we define a new property, which is
actually a part of the T-Box. On this way this approach violates the
separation of A-Box and T-Box (which is conceptual problem, not a
technical one).
3.3) Reification / Annotation Properties
Ref.: http://www.w3.org/TR/2004/REC-rdf-primer-20040210/#reification
Ref.: http://www.w3.org/TR/owl-ref/#Annotations
Ref.: http://www.w3.org/TR/owl2-primer/#Annotating_Axioms_and_Entities
With reification our example would look like this:
<Albert_Einstein> <spouse> <Elsa_Einstein> .
<statement_123> rdf:type rdf:Statement .
<statement_123> rdf:subject <Albert_Einstein> .
<statement_123> rdf:predicate <spouse> .
<statement_123> rdf:object <Elsa_Einstein> .
<statement_123> <source> <dbpedia> .
<statement_123> <startdate> "1919-01-01"^^xsd:date .
<statement_123> <enddate> "1936-12-31"^^xsd:date .
This approach does not seem to be a good option, because there are 4
triples needed just to define a new statement entity. If every triple
was to be annotated, we would end up with about 400 mio. triples without
the actual metadata!
Annotation properties seem to be a very similar approach using the OWL
namespace.
pro: We keep the original triple.
con: It blows up the amount of data unacceptably and along with this, it
requires complexer queries than using the n-ary approaches.
We think that it would a good idea to combine some of the approaches,
e.g. by using Named Graphs for annotating the source of facts and using
n-ary approaches for the rest of the metadata.
To sum up our questions we would like to know:
- What about the n-tuple approach? Can Virtuoso be extended to handle
n-tuples within Graphs, rather than triples? Which adaptions would have
to be done and which query performance could be expected compared to
SPARQL on Triples/Quads?
- About named graphs: When and how would you use this approach or how
did you use it?
- Have you made experiences with n-ary approaches - are there
problems/disadvantages with it (e.g. performance issues)?
- Do you have any other ideas of storing metadata for RDF triples?
Thanks in advance,
--
--------------------------------
Martin Gerlach
Softwareentwicklung
neofonie
Technologieentwicklung und
Informationsmanagement GmbH
Robert-Koch-Platz 4
10115 Berlin
fon: +49.30 24627 413
fax: +49.30 24627 120
[email protected]
http://www.neofonie.de
Handelsregister
Berlin-Charlottenburg: HRB 67460
Geschaeftsfuehrung
Helmut Hoffer von Ankershoffen
(Sprecher der Geschaeftsfuehrung)
Nurhan Yildirim
-------------------------------
WePad
Das WePad ist ein Tablet der neuesten Generation. Dem Nutzer bietet es
schnellen Zugang zum Internet, eine komplette Welt von sofort nutzbaren
Applikationen und einfachen Zugriff auf Bücher, Fotos sowie auf Magazine
und Tageszeitungen verschiedener Verlage, die mit dem WeMagazine
ePublishing Eco System realisiert wurden. Mehr über das WePad auf
www.wepad.mobi oder auf www.facebook.com/WePad.
Hello all,
for a research project we currently have to make some decisions regarding
ontology modeling.
We would like to invite you to discuss some general issues and are interested
in your experiences and ideas.
To give you an idea on kind and size of the ontology to be modeled: Until now
we have been working with the DBpedia ontology with linked data added from
Freebase, Geonames, and Yago concerning resources of rdf:type Person, Place,
Event, Work and Organisation. Thus we are dealing with about 100 million
triples corresponding to approx. 7.6 million resources.
We are trying to devise a system for incrementally adding facts to the target
ontology using the sources mentioned above plus additional linked data. We are
also going to enable end users of our system to add facts.
Moreover we would like to annotate certain facts (i.e., triples, or groups of
triples) with the following pieces of metadata:
- Source (e.g., "DBpedia", "Freebase", "<username>")
- Temporal information, e.g. <Albert_Einstein> <spouse> <Elsa_Einstein>
[start:1919, end: 1936]
Eventually, this might be extended to additionally comprise the following:
- Timestamp (when the triple/s was/were added)
- Confidence value, e.g., in the case a fact was extracted from fulltext by a
text mining algorithm that provides a measurement of confidence
Our non-functional requirements mainly focus on querying with high performance.
This includes that the amount of data should not become to big - ideally, the
whole ontology can be loaded into RAM.
It's furthermore preferable to stay with RDF but performance issues generally
have the higher priority for us.
We then would like to query for data within time ranges, e.g.
"All facts valid between X and Y"
Optionally in combination with other metadata ("source=z, confidence>60") or
type filtering (e.g., only relations between Person and Place) and so on.
We are currently evaluating the following approaches to metadata:
1) Annotating metadata via Named Graphs
1.1) Create a graph FOR EACH TRIPLE containing all individual metadata
example:
Graph <g_Albert_Einstein_spouse_Elsa_Einstein> containing just triple:
<Albert_Einstein> <spouse> <Elsa_Einstein>
<g_Albert_Einstein_spouse_Elsa_Einstein> <source> <dbpedia>.
<g_Albert_Einstein_spouse_Elsa_Einstein> <startdate> "1919-01-01"^^xsd:date.
<g_Albert_Einstein_spouse_Elsa_Einstein> <enddate> "1936-12-31"^^xsd:date.
[more metadata possible]
querying for a time interval would be (triples within 20th century):
select ?graph ?s ?p ?o
where {
?graph <startdate> ?start.
?graph <enddate> ?end.
filter (?start > "1900-01-01" and ?start < "2000-01-01" and ?end >
"1900-01-01" and ?end < "2000-01-01")
?graph { ?s ?p ?o }
}
pro: Each triple can have any metadata, therefore it's possible to define many
optional values (like confidence, which we will have for few triples only)
con: Huge amount of metadata triples (about 400 - 500 mio. metadata triples for
100 mio fact triples)
Is it possible to query this performantly assuming that everything fits into
main memory?
1.2) Create a graph FOR SEVERAL TRIPLES sharing same metadata
Triples with the same combination of metadata values share the same RDF graph.
The paper "Efficient Temporal Querying of RDF Data with SPARQL"
http://www.ifi.uzh.ch/pax/uploads/pdf/publication/1004/tappolet09applied.pdf
explains how to annotate triples with time intervals using named graphs. So
this approach would also fit into 1.2
One could imagine to combine metadata properties, e.g. by creating a graph
containing all triples coming from source dbpedia and having the exact time
interval 1919-01-01 and 1936-01-01.
pro: Less metadata triples in comparison to 1.1
con: Clumsy with many types of metadata. Also, when inserting data, we need an
efficient way of detecting whether a graph for a certain combination of
metadata already exists or not (hashing, querying, ...)
2) Annotating metadata via an N-TUPLE STORE
In the mailing list there once was the rumor that it would be (easily) possible
to extend Virtuoso's quad store by more columns.
Under these circumstances one could create a new column for each metadata
property, at least for frequently used properties. Thinking about this
approach, further questions arise:
- Is it in general a good idea to do this (would you recommend it)?
- Has anybody done this before?
- Is it possible to extend the SPARQL syntax in order to be able to continue
using SPARQL?
- What about performance issues?
pro: We assume that performance will be good (with additional indices). The
amount of data will hopefully be acceptable (on the one hand there is no
aggregation: each triple contains it's own metadata, on the other hand metadata
values are not saved as "expensive" types or even triples)
con: We would definetely leave standards. So data isn't interchangable
independently of the store anymore (which would be acceptable for us because we
want to provide our data along with an own Web service). And we would need
adaptions to Virtuoso - where it is not clear to which extent.
3) Annotating metadata WITHIN THE ONTOLOGY
3.1) Classical N-Ary Approach: Inserting arbitrary entities
Ref.: http://www.w3.org/TR/swbp-n-aryRelations/
In practice n-ary relations can be modeled in different ways:
Regarding our example: Due to the fact, that the <spouse> property is reflexive
(a prop b ==> b prop a), we could write following:
<Spouse_123> <member> <Albert_Einstein> .
<Spouse_123> <member> <Elsa_Einstein> .
<Spouse_123> <source> <dbpedia> .
<Spouse_123> <startdate> "1919-01-01"^^xsd:date .
<Spouse_123> <enddate> "1936-12-31"^^xsd:date .
In this case <Albert_Einstein> and <Elsa_Einstein> are equal members to the
relation. Instead of <Spouse_123> we could also use a blank node.
pro: It seems to be a state of the art approach. This way one can model any
metadata and even express reflexive and inverse facts quite easy. In comparison
with reification, less triples are needed.
con: The original triple isn't kept. So the structure changes for triples
beeing annotated - one has to care for such while querying. General queries
like "How many persons are directly connected to other persons?" can not be
queried easily anymore. One could of course keep the original triple at the
price of increasing the overall number of triples.
3.2) Our own approach: Inserting sub properties
Let's just show an example:
<Albert_Einstein> <spouse_123> <Elsa_Einstein> .
<spouse_123> rdfs:subPropertyOf <spouse> .
<spouse_123> <source> <dbpedia> .
<spouse_123> <startdate> "1919-01-01"^^xsd:date .
<spouse_123> <enddate> "1936-12-31"^^xsd:date .
So for each relationship between two entities we need a new property that we
connect to the original property via rdfs:subPropertyOf.
pro: Relationships between resources (like <Albert_Einstein> and
<Elsa_Einstein>) stay directly connected, so it is easy to query them. Only if
we are also interested in a special property (or some of the metadata values),
we additionally have to query for the according subproperty relation. Under
some cirumstances we need less triples than the classical n-ary approach.
con: For every A-Box relationship we define a new property, which is actually a
part of the T-Box. On this way this approach violates the separation of A-Box
and T-Box (which is conceptual problem, not a technical one).
3.3) Reification / Annotation Properties
Ref.: http://www.w3.org/TR/2004/REC-rdf-primer-20040210/#reification
Ref.: http://www.w3.org/TR/owl-ref/#Annotations
Ref.: http://www.w3.org/TR/owl2-primer/#Annotating_Axioms_and_Entities
With reification our example would look like this:
<Albert_Einstein> <spouse> <Elsa_Einstein> .
<statement_123> rdf:type rdf:Statement .
<statement_123> rdf:subject <Albert_Einstein> .
<statement_123> rdf:predicate <spouse> .
<statement_123> rdf:object <Elsa_Einstein> .
<statement_123> <source> <dbpedia> .
<statement_123> <startdate> "1919-01-01"^^xsd:date .
<statement_123> <enddate> "1936-12-31"^^xsd:date .
This approach does not seem to be a good option, because there are 4 triples
needed just to define a new statement entity. If every triple was to be
annotated, we would end up with about 400 mio. triples without the actual
metadata!
Annotation properties seem to be a very similar approach using the OWL
namespace.
pro: We keep the original triple.
con: It blows up the amount of data unacceptably and along with this, it
requires complexer queries than using the n-ary approaches.
We think that it would a good idea to combine some of the approaches, e.g. by
using Named Graphs for annotating the source of facts and using n-ary
approaches for the rest of the metadata.
To sum up our questions we would like to know:
- What about the n-tuple approach? Can Virtuoso be extended to handle n-tuples
within Graphs, rather than triples? Which adaptions would have to be done and
which query performance could be expected compared to SPARQL on Triples/Quads?
- About named graphs: When and how would you use this approach or how did you
use it?
- Have you made experiences with n-ary approaches - are there
problems/disadvantages with it (e.g. performance issues)?
- Do you have any other ideas of storing metadata for RDF triples?
Thanks in advance,
...
[neo-footer]