Re: [Virtuoso-devel] Annotating RDF triples with metadata

Ivan Mikhailov Mon, 29 Mar 2010 17:19:53 +0000

Hello Martin,

The cost of timestamping depends on used combinations of time intervals
in stored data and "windows" (i.e. time intervals specified in the
query). If listed in ascending order of calculation cost, cases are
(1) search for all data with time point in relatively narrow window;
(2) search for all data with whole time interval is entirely in relatively 
narrow window;
(3) search for all data with time interval in intersection with a window;
(4) search for all data such that window is inside time interval associated 
with datum;
(5) search for data with some generic predicate on time interval and window.


Case (5) is hopeless for optimization, it can't be leading part so it can only 
filter out facts selected by other conditions. 

Case (4) may require special tricky index for good scalability, because if it 
is implemented as AND of two '<=' operator then selectivity of each operator is 
low.
In addition, simple versions of interval indexes optimized for case (4) works 
fine only if time intervals in data are of some similar or at least limited 
size.
Good case is periods of work of lecturers of Leipzig Univ, periods are from 
fractions of year to tens of years and they're spread over 600 time-line.
Bad case is DBpedia with geology and fashion at the same time-line.

Unfortunately "All facts continuously valid between X and Y" is case (4).

Case (3) is similar to (4).

Cases (1) and (2) are not a catastrophe.
The only thing to remember is that the query optimizer has no idea of relation 
between the start of an interval in your data and the end.
So instead of ((start >= X) and (end <= Y)) you should write ((start between X 
and Y) and (end between X and Y))

N-tuple store may really help with cases 3 and 4, but it is waste of space for 
facts that are not annotated.
If everything or almost everything is annotated and the implementation is 
accurate and queries intensively use annotation data then it can be the fastest 
variant.
The disadvantage is that the method will require some unusual procedures for 
loading and unloading of data.
Yes, it is quite possible to extend Virtuoso's quad store by more columns, and 
the required amount of code is reasonably small, but one should know Virtuoso 
in depth to get reasonable scalability.
I don't say that the consulting contract is absolutely needed but... well, 
Kingsley knows affordable variants, esp. for academic purposes and software 
companies, contact him :)

Variant 3.2 (subproperties) will be inefficient with 3+2 indexing schema that 
is recommended for big Virtuoso 6 storages and moreover it will kill Virtuoso's 
subtype/subproperty inference because Virtuoso caches that data in memory.

Best Regards,

Ivan Mikhailov
OpenLink Software
http://virtuoso.openlinksw.com


On Mon, 2010-03-29 at 16:34 +0200, Martin Gerlach wrote:
> Hello all,
> 
> --- questions about extending VOS contained below, pls read through the
> leading text briefly, it provides background for the questions ---
> 
> for a research project we currently have to make some decisions
> regarding ontology modeling.
> 
> We would like to invite you to discuss some general issues and are
> interested in your experiences and ideas.
> 
> To give you an idea on kind and size of the ontology to be modeled:
> Until now we have been working with the DBpedia ontology with linked
> data added from Freebase, Geonames, and Yago concerning resources of
> rdf:type Person, Place, Event, Work and Organisation. Thus we are
> dealing with about 100 million triples corresponding to approx. 7.6
> million resources.
> 
> We are trying to devise a system for incrementally adding facts to the
> target ontology using the sources mentioned above plus additional linked
> data. We are also going to enable end users of our system to add facts.
> 
> Moreover we would like to annotate certain facts (i.e., triples, or
> groups of triples) with the following pieces of metadata:
> - Source (e.g., "DBpedia", "Freebase", "<username>")
> - Temporal information, e.g. <Albert_Einstein> <spouse> <Elsa_Einstein>
> [start:1919, end: 1936]
> 
> Eventually, this might be extended to additionally comprise the following:
> - Timestamp (when the triple/s was/were added)
> - Confidence value, e.g., in the case a fact was extracted from fulltext
> by a text mining algorithm that provides a measurement of confidence
> 
> Our non-functional requirements mainly focus on querying with high
> performance. This includes that the amount of data should not become to
> big - ideally, the whole ontology can be loaded into RAM.
> 
> It's furthermore preferable to stay with RDF but performance issues
> generally have the higher priority for us.
> 
> We then would like to query for data within time ranges, e.g.
> 
> "All facts valid between X and Y"
> 
> Optionally in combination with other metadata ("source=z,
> confidence>60") or type filtering (e.g., only relations between Person
> and Place) and so on.
> 
> We are currently evaluating the following approaches to metadata:
> 
> 1) Annotating metadata via Named Graphs
> 
> 1.1) Create a graph FOR EACH TRIPLE containing all individual metadata
> 
> example:
>     Graph <g_Albert_Einstein_spouse_Elsa_Einstein> containing just
> triple: <Albert_Einstein> <spouse> <Elsa_Einstein>
>     <g_Albert_Einstein_spouse_Elsa_Einstein> <source> <dbpedia>.
>     <g_Albert_Einstein_spouse_Elsa_Einstein> <startdate>
> "1919-01-01"^^xsd:date.
>     <g_Albert_Einstein_spouse_Elsa_Einstein> <enddate>
> "1936-12-31"^^xsd:date.
>     [more metadata possible]
> 
> querying for a time interval would be (triples within 20th century):
>     select ?graph ?s ?p ?o
>     where {
>         ?graph <startdate> ?start.
>         ?graph <enddate> ?end.
>         filter (?start > "1900-01-01" and ?start < "2000-01-01" and ?end
> > "1900-01-01" and ?end < "2000-01-01")
>         ?graph { ?s ?p ?o }
>     }
> 
> 
> pro: Each triple can have any metadata, therefore it's possible to
> define many optional values (like confidence, which we will have for few
> triples only)
> con: Huge amount of metadata triples (about 400 - 500 mio. metadata
> triples for 100 mio fact triples)
> Is it possible to query this performantly assuming that everything fits
> into main memory?
> 
> 1.2) Create a graph FOR SEVERAL TRIPLES sharing same metadata
> 
> Triples with the same combination of metadata values share the same RDF
> graph.
> 
> The paper "Efficient Temporal Querying of RDF Data with SPARQL"
> http://www.ifi.uzh.ch/pax/uploads/pdf/publication/1004/tappolet09applied.pdf
> explains how to annotate triples with time intervals using named graphs.
> So this approach would also fit into 1.2
> 
> One could imagine to combine metadata properties, e.g. by creating a
> graph containing all triples coming from source dbpedia and having the
> exact time interval 1919-01-01 and 1936-01-01.
> 
> pro: Less metadata triples in comparison to 1.1
> con: Clumsy with many types of metadata. Also, when inserting data, we
> need an efficient way of detecting whether a graph for a certain
> combination of metadata already exists or not (hashing, querying, ...)
> 
> 
> 2) Annotating metadata via an N-TUPLE STORE
> 
> In the mailing list there once was the rumor that it would be (easily)
> possible to extend Virtuoso's quad store by more columns.
> 
> Under these circumstances one could create a new column for each
> metadata property, at least for frequently used properties. Thinking
> about this approach, further questions arise:
> - Is it in general a good idea to do this (would you recommend it)?
> - Has anybody done this before?
> - Is it possible to extend the SPARQL syntax in order to be able to
> continue using SPARQL?
> - What about performance issues?
> 
> pro: We assume that performance will be good (with additional indices).
> The amount of data will hopefully be acceptable (on the one hand there
> is no aggregation: each triple contains it's own metadata, on the other
> hand metadata values are not saved as "expensive" types or even triples)
> con: We would definetely leave standards. So data isn't interchangable
> independently of the store anymore (which would be acceptable for us
> because we want to provide our data along with an own Web service). And
> we would need adaptions to Virtuoso - where it is not clear to which extent.
> 
> 
> 3) Annotating metadata WITHIN THE ONTOLOGY
> 
> 3.1) Classical N-Ary Approach: Inserting arbitrary entities
> 
> Ref.: http://www.w3.org/TR/swbp-n-aryRelations/
> 
> In practice n-ary relations can be modeled in different ways:
> 
> Regarding our example: Due to the fact, that the <spouse> property is
> reflexive (a prop b ==> b prop a), we could write following:
> 
>     <Spouse_123> <member> <Albert_Einstein> .
>     <Spouse_123> <member> <Elsa_Einstein> .
>     <Spouse_123> <source> <dbpedia> .
>     <Spouse_123> <startdate> "1919-01-01"^^xsd:date .
>     <Spouse_123> <enddate> "1936-12-31"^^xsd:date .
> 
> In this case <Albert_Einstein> and <Elsa_Einstein> are equal members to
> the relation. Instead of <Spouse_123> we could also use a blank node.
> 
> pro: It seems to be a state of the art approach. This way one can model
> any metadata and even express reflexive and inverse facts quite easy. In
> comparison with reification,  less triples are needed.
> con: The original triple isn't kept. So the structure changes for
> triples beeing annotated - one has to care for such while querying.
> General queries like "How many persons are directly connected to other
> persons?" can not be queried easily anymore. One could of course keep
> the original triple at the price of increasing the overall number of
> triples.
> 
> 3.2) Our own approach: Inserting sub properties
> 
> Let's just show an example:
> 
>     <Albert_Einstein> <spouse_123> <Elsa_Einstein> .
>     <spouse_123> rdfs:subPropertyOf <spouse> .
>     <spouse_123> <source> <dbpedia> .
>     <spouse_123> <startdate> "1919-01-01"^^xsd:date .
>     <spouse_123> <enddate> "1936-12-31"^^xsd:date .
> 
> So for each relationship between two entities we need a new property
> that we connect to the original property via rdfs:subPropertyOf.
> 
> pro: Relationships between resources (like <Albert_Einstein> and
> <Elsa_Einstein>) stay directly connected, so it is easy to query them.
> Only if we are also interested in a special property (or some of the
> metadata values), we additionally have to query for the according
> subproperty relation. Under some cirumstances we need less triples than
> the classical n-ary approach.
> con: For every A-Box relationship we define a new property, which is
> actually a part of the T-Box. On this way this approach violates the
> separation of A-Box and T-Box (which is conceptual problem, not a
> technical one).
> 
> 3.3) Reification / Annotation Properties
> 
> Ref.: http://www.w3.org/TR/2004/REC-rdf-primer-20040210/#reification
> Ref.: http://www.w3.org/TR/owl-ref/#Annotations
> Ref.: http://www.w3.org/TR/owl2-primer/#Annotating_Axioms_and_Entities
> 
> With reification our example would look like this:
>     <Albert_Einstein> <spouse> <Elsa_Einstein> .
>     <statement_123> rdf:type rdf:Statement .
>     <statement_123> rdf:subject <Albert_Einstein> .
>     <statement_123> rdf:predicate <spouse> .
>     <statement_123> rdf:object <Elsa_Einstein> .
>     <statement_123> <source> <dbpedia> .
>     <statement_123> <startdate> "1919-01-01"^^xsd:date .
>     <statement_123> <enddate> "1936-12-31"^^xsd:date .
> 
> This approach does not seem to be a good option, because there are 4
> triples needed just to define a new statement entity. If every triple
> was to be annotated, we would end up with about 400 mio. triples without
> the actual metadata!
> 
> Annotation properties seem to be a very similar approach using the OWL
> namespace.
> 
> pro: We keep the original triple.
> con: It blows up the amount of data unacceptably and along with this, it
> requires complexer queries than using the n-ary approaches.
> 
> We think that it would a good idea to combine some of the approaches,
> e.g. by using Named Graphs for annotating the source of facts and using
> n-ary approaches for the rest of the metadata.
> 
> To sum up our questions we would like to know:
> - What about the n-tuple approach? Can Virtuoso be extended to handle
> n-tuples within Graphs, rather than triples? Which adaptions would have
> to be done and which query performance could be expected compared to
> SPARQL on Triples/Quads?
> - About named graphs: When and how would you use this approach or how
> did you use it?
> - Have you made experiences with n-ary approaches - are there
> problems/disadvantages with it (e.g. performance issues)?
> - Do you have any other ideas of storing metadata for RDF triples?
> 
> Thanks in advance,

Re: [Virtuoso-devel] Annotating RDF triples with metadata

Reply via email to