I agree that it’s much practical to do validation and data cleaning to RDF data
before you run queries rather than after.
If your data is clean, you can easily write SPARQL queries that do everything
people do with SQL queries and then some. If you do your data cleaning in a
distinct post-processing phase, you can do testing of various sorts to get a
handle on the state of the data. You can then
The problem of cleaning up bad data by modifying the query (probably in a large
number of variations) and post-processing the results is a difficult one, and
in general I’d say beyond the state of the art at the moment.
I’ll mention that my open source framework
https://github.com/paulhoule/infovore/wiki
addresses these issues by prefiltering the data in a batch job with a process
that handles large data sets easily because it minimizes memory consumption.
Rather than supporting SPIN validation over the complete graph, it could
support SPIN validation or the equivalent over particular subgraphs can be
supplied scalably.
As for right now, simple rules applied to individual triples, combined with
other entropy-reducing transformations can turn data sources like Freebase and
DBpedia into 100% valid RDF and maintains high data fidelity despite
substantial compression (approaching 50% in the case of Freebase)
Infovore’s subject-partitioned output goes particularly well with Virtuoso.
Infovore breaks up each RDF knowledge base into a large number of shards. If
you start multiple bulk loaders, Virtuoso will load from the shards in
parallel, which can speed up loading 3-4 times or so on a common quad core
computer... For more than a year I’ve been using it to load stuff into
Virtuoso that would be hard to fit in otherwise...
From: Matías Parodi
Sent: Wednesday, April 24, 2013 11:24 AM
To: Alexey Zakhlestin
Cc: [email protected]
Subject: Re: [Virtuoso-users] Data Integrity in RDF?
There is this project called SPIN (spingrdf.org) which lets you add constraints
that are checked when you insert new data into the store. In this way, as I see
it, you're not "blocking the data-flow". Whatever gets into the store is
"valid", and the inferences are done in the same way (there's no difference
after you inserted the instances).
Enforcing constraints in the application layer requires, in this particular
case at least, a lot of code and overhead (latency!).
In case Virtuoso doesn't support data integrity for RDF I think I'm going to
use the stack TDB+Jena+SPIN+Virtuoso (I still didn't try to integrate them all).
On Wed, Apr 24, 2013 at 12:12 PM, Alexey Zakhlestin <[email protected]> wrote:
On 24.04.2013, at 19:04, Matías Parodi <[email protected]> wrote:
> Any idea about forcing constraints in Virtuoso?
well… my personal belief is, that constraints should be enforced on
application level
RDF is good because it allows you to store and exchange
opaque-but-introspectable data
Application, on the other hand, can use RDFs/OWL/whatever rules to fit it
into some system, but this shouldn't block data-flow on storage level.
So, this way you can let your application to comfortably work with a subset
of RDF, ignoring other pieces which fly by
--------------------------------------------------------------------------------
------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
--------------------------------------------------------------------------------
_______________________________________________
Virtuoso-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/virtuoso-users
------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
_______________________________________________
Virtuoso-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/virtuoso-users