I agree that it’s much practical to do validation and data cleaning to RDF data 
before you run queries rather than after.

If your data is clean,  you can easily write SPARQL queries that do everything 
people do with SQL queries and then some.  If you do your data cleaning in a 
distinct post-processing phase,  you can do testing of various sorts to get a 
handle on the state of the data.  You can then

The problem of cleaning up bad data by modifying the query (probably in a large 
number of variations) and post-processing the results is a difficult one,  and 
in general I’d say beyond the state of the art at the moment.

I’ll mention that my open source framework

https://github.com/paulhoule/infovore/wiki

addresses these issues by prefiltering the data in a batch job with a process 
that handles large data sets easily because it minimizes memory consumption.  
Rather than supporting SPIN validation over the complete graph,  it could 
support SPIN validation or the equivalent over particular subgraphs can be 
supplied scalably.

As for right now,  simple rules applied to individual triples,  combined with 
other entropy-reducing transformations can turn data sources like Freebase and 
DBpedia into 100% valid RDF and maintains high data fidelity despite 
substantial compression (approaching 50% in the case of Freebase)

Infovore’s subject-partitioned output goes particularly well with Virtuoso.  
Infovore breaks up each RDF knowledge base into a large number of shards.  If 
you start multiple bulk loaders,  Virtuoso will load from the shards in 
parallel,  which can speed up loading 3-4 times or so on a common quad core 
computer...  For more than a year I’ve been using it to load stuff into 
Virtuoso that would be hard to fit in otherwise...


From: Matías Parodi 
Sent: Wednesday, April 24, 2013 11:24 AM
To: Alexey Zakhlestin 
Cc: [email protected] 
Subject: Re: [Virtuoso-users] Data Integrity in RDF?

There is this project called SPIN (spingrdf.org) which lets you add constraints 
that are checked when you insert new data into the store. In this way, as I see 
it, you're not "blocking the data-flow". Whatever gets into the store is 
"valid", and the inferences are done in the same way (there's no difference 
after you inserted the instances).

Enforcing constraints in the application layer requires, in this particular 
case at least, a lot of code and overhead (latency!).

In case Virtuoso doesn't support data integrity for RDF I think I'm going to 
use the stack TDB+Jena+SPIN+Virtuoso (I still didn't try to integrate them all).




On Wed, Apr 24, 2013 at 12:12 PM, Alexey Zakhlestin <[email protected]> wrote:


  On 24.04.2013, at 19:04, Matías Parodi <[email protected]> wrote:

  > Any idea about forcing constraints in Virtuoso?


  well… my personal belief is, that constraints should be enforced on 
application level
  RDF is good because it allows you to store and exchange 
opaque-but-introspectable data

  Application, on the other hand, can use RDFs/OWL/whatever rules to fit it 
into some system, but this shouldn't block data-flow on storage level.
  So, this way you can let your application to comfortably work with a subset 
of RDF, ignoring other pieces which fly by



--------------------------------------------------------------------------------
------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service 
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr 


--------------------------------------------------------------------------------
_______________________________________________
Virtuoso-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/virtuoso-users
------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service 
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
_______________________________________________
Virtuoso-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/virtuoso-users

Reply via email to