Hello,

Sebastian Hellmann schrieb:
Hello,

[...]

1. Use of RDF Reification
It is a clean solution, as we could add even more metadata to triples, like which extractor they come from or a confidence value. The drawback is that they basically need 4 extra triples + the metadata, which not only raises the total triple count, but also the number of updates and queries to keep updates consistent. (DBpedia could break the billion triple border with this)

For those of you not familar with OWL 2 Axiom Annotations (similar to RDF Reification), let me give a short explanation:

Assume you have a triple $s $p $o. To make an annotation about this triple/axiom, you need to add the following (in Turtle syntax):

$a rdf:type owl:Axiom;
   owl:subject $s;
   owl:predicate $p;
   owl:object $o

The purpose of this construct is that we now have an identifier $a for our triple. We can then annotate it, for instance:

$a extractedBy extractors:InfoboxExtractor;
   extractedFromTemplate templates:city;
   extractedOn "2009-10-25T04:00:00-05:00"^^xsd:dateTime .
   (maybe more meta information, e.g. confidence value, what led to the
   modification e.g. page change, template change)

An advantage of this approach is that we make the meta information explicit and conform to OWL 2 and RDF. It could be queried and (without too much effort) also made available via the Linked Data interface. It would also allow us to create regular dumps from our live extraction. The annotations can be used by the DBpedia live extraction as Sebastian explained. A disadvantage is that we need a lot more triples compared to the current situation. Assuming a full extraction would currently require 300 million triples, storing additional annotations this way would require 2.4 billion triples for DBpedia.

The specific questions we have, are:

1.) Do you consider the increase in triple count problematic?

2.) How are SPARQL SELECT queries (not involving annotations) affected? Can we expect roughly the same performance (could be the case if Virtuoso recognizes annotations), slightly worse performance, or much worse performance?

3.) SPARUL: Sebastian mentioned that 6 million triples will need to changed per day by the live extraction. Using annotations, this would rise by a factor of three (estimated). Can approx. 20 million triple updates per day be handled by the Virtuoso server(s) running DBpedia?

Of course, we cannot expect any precise answers here, but educated guesses are very welcome. :-)

Kind regards,

Jens

--
Dipl. Inf. Jens Lehmann
Department of Computer Science, University of Leipzig
Homepage: http://www.jens-lehmann.org
GPG Key: http://jens-lehmann.org/jens_lehmann.asc

Reply via email to