Sebastian Hellmann wrote:
Hello,
Note: If you want the short version. here is the single question I have:
Can you keep Metadata for triples in Virtuoso in an effective way? And
if so where can I read about it. The concrete problem I have is here:
The DBpedia live extraction will be working around next week in a first
version, which means full extraction of the English Wikipedia. (We also
have a group in Berlin that will run it for the German one, soon ).
In this first version, we have the basics well covered, but we still
have a quite complex engineering problem respective Triple provenance.
Performance is the biggest issue here, as we have well over 140.000
article page updates a day with about 40+ generated Triples per article
page on average, resulting in 6 mio triples a day minimum.
Now, this is all handable I think, as we have a diff, which reduces the
number of statements deleted/inserted into the Virtuoso endpoint.
All the extractors previously used provide a disparate vocabulary, so it
is clear which triple comes from which extractor, e.g. rdfs:label from
the LabelExtractor or skos:Subject by the ArticleCategoryExtractor.
We found a way to annotate Templates in Wikipedia which work about the
same way as the DBpedia Ontology and Mapping. If these mappings are
changed, the changes ideally should display in DBpedia immediately (or
with 24h overnight delay ).
So lets say we have a template for Infobox Person and it has a property
placeOfBirth. If a User creates a mapping to dbpedia:birthplace, a
process changes all properties placeOfBirth to birthplace for subject
that use the template (dbpedia:wikipageUsesTemplate). Problems arise if
the placeOfBirth is mapped to a property outside of the dbpedia
namespace, e.g. foaf:birthPlace. There currently is no way to track
provenance of created triples, as we can not say anymore if
foaf:birthPlace was created by the Infobox or PersonData extractor [1]
This accounts also, if two template properties are mapped to the same
property.
We have thought about different solutions, but we clearly need help
regarding the features of Virtuoso. Our main aim is to add metadata to
triples. A unique triple ID (RDF_QUINT instead of QUAD) is clearly not
an option. Here are some other thoughts:
Virtuoso has:
<Graph-Grp><Graph><Subject><Predicate><Object>
The above is a quint.
You have IRIs for Graph Groups and Named Graphs [1].
DBpedia source file IRIs can be used as Named Graph IRIs.
1. Use of RDF Reification
It is a clean solution, as we could add even more metadata to triples,
like which extractor they come from or a confidence value. The drawback
is that they basically need 4 extra triples + the metadata, which not
only raises the total triple count, but also the number of updates and
queries to keep updates consistent. (DBpedia could break the billion
triple border with this)
2. We could have an extra Graph for each template, where all triples are
stored created and mapped by each template. This generates about 50% or
less more triples but also about 200.000 graphs (one graph for each
template). This would solve the provenance issue, because it would be
clear if foaf:birthPlace was extracted by the Infobox extractor with a
mapping or by PersonDataExtractor
3.a) Reschedule and parse anew.
It would be easy to just parse all affected pages once more. But I doubt
it would scale. If a template annotation changes all articles who use
that template need to be reparsed. the popular templates are used by
about 50.000-80.000 Articles so that would mean a small change here
would add half a days worth of parsing to the extraction queue.
3.b) Optimize 3.a)
Of course there could be checks and measure to decrease the number of
reparsed pages, marking article pages with conflicts as dirty, etc.
3.c) Ignore changes
One option would be just not to do anything if an annotation changes and
just wait till the articles themselves are changes and newly parsed.
This means that the template annotation wouldn't have the immediate
effect and it could need a couple of weeks till all changes are included
into DBpedia
4. Space effective saving with RDF-Views
There could be a db table with G,S,P,O,Pm,Om. (Pm = meta property, Om =
meta object, Pm could be extractor and Om extractorID). This table would
be the same solution as 1, but in a more effective way resulting in a db
row instead of 5 triples per row. RDF-Views could then allow to query it
with SPARQL.
5. Please tell me if you have more ideas, what's possible with Virtuoso?
What would be the best way.
If you are set with the extractors, we can quite easily derive a DBpedia
Cartridge from our Wikipedia Cartridge. The only differences will be:
1. Graph Grp. IRI will be used for DBpedia
2. Source File IRI for Named Graph IRIs
3. SPARUL (post processing of the OAI-PMH feeds) will apply to
respective Named Gaph IRIs
4. Make a schedule in Virtuoso for item 3.
Scalability is not an issue especially with the Cluster Edition which
can spread I/O over many Cluster Nodes. We can even use Node Roles to
control what's in the foreground and whats in the background etc..
All I need is confirmation that the extraction for realtime DBpedia is done.
Links:
1. http://docs.openlinksw.com/virtuoso/rdfgraphsecurity.html -- Graph
Security
2. http://dbpedia2.openlinksw.com:8899/void/Dataset -- shows you the
quints in action where each DBpedia source file is a Graph IRI within
the Graph Group IRI <http://dbpedia.org> .
Kingsley
Many thanks,
Sebastian, AKSW
http://bis.informatik.uni-leipzig.de/SebastianHellmann
[1] http://en.wikipedia.org/wiki/Wikipedia:Persondata
------------------------------------------------------------------------------
Register Now for Creativity and Technology (CaT), June 3rd, NYC. CaT
is a gathering of tech-side developers & brand creativity professionals. Meet
the minds behind Google Creative Lab, Visual Complexity, Processing, &
iPhoneDevCamp as they present alongside digital heavyweights like Barbarian
Group, R/GA, & Big Spaceship. http://p.sf.net/sfu/creativitycat-com
_______________________________________________
Virtuoso-users mailing list
Virtuoso-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/virtuoso-users
--
Regards,
Kingsley Idehen Weblog: http://www.openlinksw.com/blog/~kidehen
President & CEO
OpenLink Software Web: http://www.openlinksw.com