ok, this will work.
If all information about a resource is held in one named graph, then a
straighforward delete/insert progress will remove the following sets
from dbpedia:
owl:sameAs links, images, yago, umbel, opencyc, everything with the
dbpedia ontology namespace as these will be replaced by wikipedia
template annotations and any other information not from the extractors
One solution could be either to move them to a different named graph or
do a diff.
The diff would be:
DELETE FROM :London
{?s ?p ?o}
WHERE {
GRAPH :London {
?s ?p ?o .
OPTIONAL {
?a rdf:type owl:Axiom.
?a owl:subject ?s.
?a owl;predicate ?p.
?a owl:object ?o.
?a ?p2 ?o2.
FILTER (to select all that should stay).
}
FILTER (!bound(?a) && ?a != ?s).
}
}
The basic idea should be clear, delete everything else that doesn't
match the pattern.
Actually to move anything that should stay to a different graph would be
much easier now.
@Kingsley. What do you think? I can prepare initial metadata for the
loaded datasets tomorrow. What would be your decision? Shall we use the
2.4 billion triples or should we have different named graphs for each
extractor/template and not for each instance. The latter does not
produce any overhead and does not need a diff operation, as provenance
of triples is clear. The owl:AxiomAnnotations are far more powerfull,
but come at a price.
Also what is happening to the data that will not be refreshed like yago
and owl:sameAs links. Are we moving them to separate graphs or shall I
implement the new diff?
Regards, Sebastian
Kingsley Idehen schrieb:
Jens Lehmann wrote:
Hello,
Sebastian Hellmann schrieb:
Hello,
[...]
1. Use of RDF Reification
It is a clean solution, as we could add even more metadata to triples,
like which extractor they come from or a confidence value. The drawback
is that they basically need 4 extra triples + the metadata, which not
only raises the total triple count, but also the number of updates and
queries to keep updates consistent. (DBpedia could break the billion
triple border with this)
For those of you not familar with OWL 2 Axiom Annotations (similar to
RDF Reification), let me give a short explanation:
Assume you have a triple $s $p $o. To make an annotation about this
triple/axiom, you need to add the following (in Turtle syntax):
$a rdf:type owl:Axiom;
owl:subject $s;
owl:predicate $p;
owl:object $o
The purpose of this construct is that we now have an identifier $a for
our triple. We can then annotate it, for instance:
$a extractedBy extractors:InfoboxExtractor;
extractedFromTemplate templates:city;
extractedOn "2009-10-25T04:00:00-05:00"^^xsd:dateTime .
(maybe more meta information, e.g. confidence value, what led to the
modification e.g. page change, template change)
An advantage of this approach is that we make the meta information
explicit and conform to OWL 2 and RDF. It could be queried and (without
too much effort) also made available via the Linked Data interface. It
would also allow us to create regular dumps from our live extraction.
The annotations can be used by the DBpedia live extraction as Sebastian
explained. A disadvantage is that we need a lot more triples compared to
the current situation. Assuming a full extraction would currently
require 300 million triples, storing additional annotations this way
would require 2.4 billion triples for DBpedia.
The specific questions we have, are:
1.) Do you consider the increase in triple count problematic?
Since this is going to be V6 based, the size of DBpedia doesn't really
matter. For instance, we have 4.5+ Billion (maybe 5+ now) on:
http://lod.openlinksw.com. This is the kind of cluster setup we are
going to use for DBpedia realtime once ready.
2.) How are SPARQL SELECT queries (not involving annotations) affected?
Can we expect roughly the same performance (could be the case if
Virtuoso recognizes annotations), slightly worse performance, or much
worse performance?
I don't expect performance problems.
We we implement OWL2 inference enhancements it will get better. But even
right now I don't see the SPARQL performance as an issue.
3.) SPARUL: Sebastian mentioned that 6 million triples will need to
changed per day by the live extraction. Using annotations, this would
rise by a factor of three (estimated). Can approx. 20 million triple
updates per day be handled by the Virtuoso server(s) running DBpedia?
Since this is going to be load and deletes it shouldn't be too much
trouble, but we should test and see what happens, and where issues arise
we can make
specific tweaks etc..
Of course, we cannot expect any precise answers here, but educated
guesses are very welcome. :-)
Sure.
Kingsley
Kind regards,
Jens