Re: [Virtuoso-users] DBpedia-Live & Reification or Unique Triple IDs for adding metadata

Kingsley Idehen Sun, 31 May 2009 17:45:05 +0000

Sebastian Hellmann wrote:

Hello,
Note: If you want the short version. here is the single question I have:Can you keep Metadata for triples in Virtuoso in an effective way? Andif so where can I read about it. The concrete problem I have is here:
The DBpedia live extraction will be working around next week in a firstversion, which means full extraction of the English Wikipedia. (We alsohave a group in Berlin that will run it for the German one, soon ).
In this first version, we have the basics well covered, but we stillhave a quite complex engineering problem respective Triple provenance.Performance is the biggest issue here, as we have well over 140.000article page updates a day with about 40+ generated Triples per articlepage on average, resulting in 6 mio triples a day minimum.
Now, this is all handable I think, as we have a diff, which reduces thenumber of statements deleted/inserted into the Virtuoso endpoint.
All the extractors previously used provide a disparate vocabulary, so itis clear which triple comes from which extractor, e.g. rdfs:label fromthe LabelExtractor or skos:Subject by the ArticleCategoryExtractor.
We found a way to annotate Templates in Wikipedia which work about thesame way as the DBpedia Ontology and Mapping. If these mappings arechanged, the changes ideally should display in DBpedia immediately (orwith 24h overnight delay ).So lets say we have a template for Infobox Person and it has a propertyplaceOfBirth. If a User creates a mapping to dbpedia:birthplace, aprocess changes all properties placeOfBirth to birthplace for subjectthat use the template (dbpedia:wikipageUsesTemplate). Problems arise ifthe placeOfBirth is mapped to a property outside of the dbpedianamespace, e.g. foaf:birthPlace. There currently is no way to trackprovenance of created triples, as we can not say anymore iffoaf:birthPlace was created by the Infobox or PersonData extractor [1]
This accounts also, if two template properties are mapped to the sameproperty.
We have thought about different solutions, but we clearly need helpregarding the features of Virtuoso. Our main aim is to add metadata totriples. A unique triple ID (RDF_QUINT instead of QUAD) is clearly notan option. Here are some other thoughts:


Virtuoso has:

<Graph-Grp><Graph><Subject><Predicate><Object>

The above is a quint.

You have IRIs for Graph Groups and Named Graphs [1].

DBpedia source file IRIs can be used as Named Graph IRIs.

1. Use of RDF Reification
It is a clean solution, as we could add even more metadata to triples,like which extractor they come from or a confidence value. The drawbackis that they basically need 4 extra triples + the metadata, which notonly raises the total triple count, but also the number of updates andqueries to keep updates consistent. (DBpedia could break the billiontriple border with this)
2. We could have an extra Graph for each template, where all triples arestored created and mapped by each template. This generates about 50% orless more triples but also about 200.000 graphs (one graph for eachtemplate). This would solve the provenance issue, because it would beclear if foaf:birthPlace was extracted by the Infobox extractor with amapping or by PersonDataExtractor
3.a) Reschedule and parse anew.
It would be easy to just parse all affected pages once more. But I doubtit would scale. If a template annotation changes all articles who usethat template need to be reparsed. the popular templates are used byabout 50.000-80.000 Articles so that would mean a small change herewould add half a days worth of parsing to the extraction queue.
3.b) Optimize 3.a)
Of course there could be checks and measure to decrease the number ofreparsed pages, marking article pages with conflicts as dirty, etc.
3.c) Ignore changes
One option would be just not to do anything if an annotation changes andjust wait till the articles themselves are changes and newly parsed.This means that the template annotation wouldn't have the immediateeffect and it could need a couple of weeks till all changes are includedinto DBpedia
4. Space effective saving with RDF-Views
There could be a db table with G,S,P,O,Pm,Om. (Pm = meta property, Om =meta object, Pm could be extractor and Om extractorID). This table wouldbe the same solution as 1, but in a more effective way resulting in a dbrow instead of 5 triples per row. RDF-Views could then allow to query itwith SPARQL.
5. Please tell me if you have more ideas, what's possible with Virtuoso?
What would be the best way.

If you are set with the extractors, we can quite easily derive a DBpediaCartridge from our Wikipedia Cartridge. The only differences will be:


1. Graph Grp. IRI will be used for DBpedia
2. Source File IRI for Named Graph IRIs

3. SPARUL (post processing of the OAI-PMH feeds) will apply torespective Named Gaph IRIs

4. Make a schedule in Virtuoso for item 3.

Scalability is not an issue especially with the Cluster Edition whichcan spread I/O over many Cluster Nodes. We can even use Node Roles tocontrol what's in the foreground and whats in the background etc..


All I need is confirmation that the extraction for realtime DBpedia is done.

Links:

1. http://docs.openlinksw.com/virtuoso/rdfgraphsecurity.html -- GraphSecurity2. http://dbpedia2.openlinksw.com:8899/void/Dataset -- shows you thequints in action where each DBpedia source file is a Graph IRI withinthe Graph Group IRI <http://dbpedia.org> .




Kingsley

Many thanks,
Sebastian, AKSW
http://bis.informatik.uni-leipzig.de/SebastianHellmann


[1] http://en.wikipedia.org/wiki/Wikipedia:Persondata


------------------------------------------------------------------------------
Register Now for Creativity and Technology (CaT), June 3rd, NYC. CaTis a gathering of tech-side developers & brand creativity professionals. Meetthe minds behind Google Creative Lab, Visual Complexity, Processing, &iPhoneDevCamp as they present alongside digital heavyweights like BarbarianGroup, R/GA, & Big Spaceship. http://p.sf.net/sfu/creativitycat-com_______________________________________________
Virtuoso-users mailing list
Virtuoso-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/virtuoso-users



--


Regards,

Kingsley Idehen       Weblog: http://www.openlinksw.com/blog/~kidehen

President & CEOOpenLink Software Web: http://www.openlinksw.com

Re: [Virtuoso-users] DBpedia-Live & Reification or Unique Triple IDs for adding metadata

Reply via email to