Re: [Virtuoso-users] DBpedia-Live & Reification or Unique Triple IDs for adding metadata

Kingsley Idehen Mon, 01 Jun 2009 19:55:23 +0000

Sebastian Hellmann wrote:

ok, this will work.
If all information about a resource is held in one named graph, then astraighforward delete/insert progress will remove the following setsfrom dbpedia:owl:sameAs links, images, yago, umbel, opencyc, everything with thedbpedia ontology namespace as these will be replaced by wikipediatemplate annotations and any other information not from the extractors
One solution could be either to move them to a different named graphor do a diff.
The diff would be:
DELETE FROM :London
{?s ?p ?o}
WHERE {
GRAPH :London {
   ?s ?p ?o .
   OPTIONAL {
       ?a rdf:type owl:Axiom.
       ?a owl:subject ?s.
       ?a owl;predicate ?p.
       ?a owl:object ?o.
       ?a ?p2 ?o2.
       FILTER (to select all that should stay).
   }
   FILTER (!bound(?a)  && ?a != ?s).
}
}
The basic idea should be clear though, delete everything else thatdoesn't match the pattern.
Actually to move anything that should stay to a different graph wouldbe much easier now.
@Kingsley. What do you think? I can prepare initial metadata for theloaded datasets tomorrow. What would be your decision? Shall we usethe 2.4 billion triples or should we have different named graphs foreach extractor, which does not produce any overhead, but is clearlynot an optimal solution.

Let try the 2.4 billion triples.

The owl:AxiomAnnotations are far more powerfull.
Also what is happening to the data that will not be refreshed likeyago and owl:sameAs links. Are we moving them to separate graphs orshall I implement the new diff?

They will be in different graphs. You can get a feel for the new graphpartitioning at: http://dbpedia2.openlinksw.com:8895/void/Dataset .


Kingsley

Regards, Sebastian



Kingsley Idehen schrieb:
Jens Lehmann wrote:
Hello,

Sebastian Hellmann schrieb:
Hello,
[...]
1. Use of RDF Reification
It is a clean solution, as we could add even more metadata totriples, like which extractor they come from or a confidence value.The drawback is that they basically need 4 extra triples + themetadata, which not only raises the total triple count, but alsothe number of updates and queries to keep updates consistent.(DBpedia could break the billion triple border with this)
For those of you not familar with OWL 2 Axiom Annotations (similarto RDF Reification), let me give a short explanation:
Assume you have a triple $s $p $o. To make an annotation about thistriple/axiom, you need to add the following (in Turtle syntax):
$a rdf:type owl:Axiom;
    owl:subject $s;
    owl:predicate $p;
    owl:object $o
The purpose of this construct is that we now have an identifier $afor our triple. We can then annotate it, for instance:
$a extractedBy extractors:InfoboxExtractor;
    extractedFromTemplate templates:city;
    extractedOn "2009-10-25T04:00:00-05:00"^^xsd:dateTime .
(maybe more meta information, e.g. confidence value, what led tothe
    modification e.g. page change, template change)
An advantage of this approach is that we make the meta informationexplicit and conform to OWL 2 and RDF. It could be queried and(without too much effort) also made available via the Linked Datainterface. It would also allow us to create regular dumps from ourlive extraction. The annotations can be used by the DBpedia liveextraction as Sebastian explained. A disadvantage is that we need alot more triples compared to the current situation. Assuming a fullextraction would currently require 300 million triples, storingadditional annotations this way would require 2.4 billion triplesfor DBpedia.
  The specific questions we have, are:

1.) Do you consider the increase in triple count problematic?
Since this is going to be V6 based, the size of DBpedia doesn'treally matter. For instance, we have 4.5+ Billion (maybe 5+ now) on:http://lod.openlinksw.com. This is the kind of cluster setup we aregoing to use for DBpedia realtime once ready.
2.) How are SPARQL SELECT queries (not involving annotations)affected? Can we expect roughly the same performance (could be thecase if Virtuoso recognizes annotations), slightly worseperformance, or much worse performance?
I don't expect performance problems.
We we implement OWL2 inference enhancements it will get better. Buteven right now I don't see the SPARQL performance as an issue.
3.) SPARUL: Sebastian mentioned that 6 million triples will need tochanged per day by the live extraction. Using annotations, thiswould rise by a factor of three (estimated). Can approx. 20 milliontriple updates per day be handled by the Virtuoso server(s) runningDBpedia?
Since this is going to be load and deletes it shouldn't be too muchtrouble, but we should test and see what happens, and where issuesarise we can make
specific tweaks etc..
Of course, we cannot expect any precise answers here, but educatedguesses are very welcome. :-)
Sure.

Kingsley
Kind regards,

Jens



--


Regards,

Kingsley Idehen       Weblog: http://www.openlinksw.com/blog/~kidehen

President & CEOOpenLink Software Web: http://www.openlinksw.com

Re: [Virtuoso-users] DBpedia-Live & Reification or Unique Triple IDs for adding metadata

Reply via email to