Sebastian Hellmann wrote:
Hello,

Note: If you want the short version. here is the single question I have: Can you keep Metadata for triples in Virtuoso in an effective way? And if so where can I read about it. The concrete problem I have is here:

The DBpedia live extraction will be working around next week in a first version, which means full extraction of the English Wikipedia. (We also have a group in Berlin that will run it for the German one, soon ).

In this first version, we have the basics well covered, but we still have a quite complex engineering problem respective Triple provenance. Performance is the biggest issue here, as we have well over 140.000 article page updates a day with about 40+ generated Triples per article page on average, resulting in 6 mio triples a day minimum.

Now, this is all handable I think, as we have a diff, which reduces the number of statements deleted/inserted into the Virtuoso endpoint.

All the extractors previously used provide a disparate vocabulary, so it is clear which triple comes from which extractor, e.g. rdfs:label from the LabelExtractor or skos:Subject by the ArticleCategoryExtractor.

We found a way to annotate Templates in Wikipedia which work about the same way as the DBpedia Ontology and Mapping. If these mappings are changed, the changes ideally should display in DBpedia immediately (or with 24h overnight delay ). So lets say we have a template for Infobox Person and it has a property placeOfBirth. If a User creates a mapping to dbpedia:birthplace, a process changes all properties placeOfBirth to birthplace for subject that use the template (dbpedia:wikipageUsesTemplate). Problems arise if the placeOfBirth is mapped to a property outside of the dbpedia namespace, e.g. foaf:birthPlace. There currently is no way to track provenance of created triples, as we can not say anymore if foaf:birthPlace was created by the Infobox or PersonData extractor [1]

This accounts also, if two template properties are mapped to the same property.

We have thought about different solutions, but we clearly need help regarding the features of Virtuoso. Our main aim is to add metadata to triples. A unique triple ID (RDF_QUINT instead of QUAD) is clearly not an option. Here are some other thoughts:

Virtuoso has:

<Graph-Grp><Graph><Subject><Predicate><Object>

The above is a quint.

You have IRIs for Graph Groups and Named Graphs [1].

DBpedia source file IRIs can be used as Named Graph IRIs.

1. Use of RDF Reification
It is a clean solution, as we could add even more metadata to triples, like which extractor they come from or a confidence value. The drawback is that they basically need 4 extra triples + the metadata, which not only raises the total triple count, but also the number of updates and queries to keep updates consistent. (DBpedia could break the billion triple border with this)

2. We could have an extra Graph for each template, where all triples are stored created and mapped by each template. This generates about 50% or less more triples but also about 200.000 graphs (one graph for each template). This would solve the provenance issue, because it would be clear if foaf:birthPlace was extracted by the Infobox extractor with a mapping or by PersonDataExtractor

3.a) Reschedule and parse anew.
It would be easy to just parse all affected pages once more. But I doubt it would scale. If a template annotation changes all articles who use that template need to be reparsed. the popular templates are used by about 50.000-80.000 Articles so that would mean a small change here would add half a days worth of parsing to the extraction queue.

3.b) Optimize 3.a)
Of course there could be checks and measure to decrease the number of reparsed pages, marking article pages with conflicts as dirty, etc.

3.c) Ignore changes
One option would be just not to do anything if an annotation changes and just wait till the articles themselves are changes and newly parsed. This means that the template annotation wouldn't have the immediate effect and it could need a couple of weeks till all changes are included into DBpedia

4. Space effective saving with RDF-Views
There could be a db table with G,S,P,O,Pm,Om. (Pm = meta property, Om = meta object, Pm could be extractor and Om extractorID). This table would be the same solution as 1, but in a more effective way resulting in a db row instead of 5 triples per row. RDF-Views could then allow to query it with SPARQL.

5. Please tell me if you have more ideas, what's possible with Virtuoso?
What would be the best way.


If you are set with the extractors, we can quite easily derive a DBpedia Cartridge from our Wikipedia Cartridge. The only differences will be:

1. Graph Grp. IRI will be used for DBpedia
2. Source File IRI for Named Graph IRIs
3. SPARUL (post processing of the OAI-PMH feeds) will apply to respective Named Gaph IRIs
4. Make a schedule in Virtuoso for item 3.

Scalability is not an issue especially with the Cluster Edition which can spread I/O over many Cluster Nodes. We can even use Node Roles to control what's in the foreground and whats in the background etc..

All I need is confirmation that the extraction for realtime DBpedia is done.

Links:

1. http://docs.openlinksw.com/virtuoso/rdfgraphsecurity.html -- Graph Security 2. http://dbpedia2.openlinksw.com:8899/void/Dataset -- shows you the quints in action where each DBpedia source file is a Graph IRI within the Graph Group IRI <http://dbpedia.org> .



Kingsley
Many thanks,
Sebastian, AKSW
http://bis.informatik.uni-leipzig.de/SebastianHellmann


[1] http://en.wikipedia.org/wiki/Wikipedia:Persondata


------------------------------------------------------------------------------
Register Now for Creativity and Technology (CaT), June 3rd, NYC. CaT is a gathering of tech-side developers & brand creativity professionals. Meet the minds behind Google Creative Lab, Visual Complexity, Processing, & iPhoneDevCamp as they present alongside digital heavyweights like Barbarian Group, R/GA, & Big Spaceship. http://p.sf.net/sfu/creativitycat-com _______________________________________________
Virtuoso-users mailing list
Virtuoso-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/virtuoso-users



--


Regards,

Kingsley Idehen       Weblog: http://www.openlinksw.com/blog/~kidehen
President & CEO OpenLink Software Web: http://www.openlinksw.com





Reply via email to