Re: Want to run SPARQL Query with Hadoop Map Reduce Framework

Paolo Castagna Mon, 25 Jun 2012 08:50:43 -0700

Hi Mizanur,
when you have big RDF datasets, it might make sense to use MapReduce (but only 
if you already have an Hadoop cluster at hand. Is this your case?).
You say that your data is 'huge', just for the sake of curiosity... how many 
triples/quads is 'huge'? ;-)
Most of the use cases I've seen related to statistics on RDF datasets were 
trivial MapReduce jobs.

For a couple of examples on using MapReduce with RDF datasets have a look here:
https://github.com/castagna/jena-grande
https://github.com/castagna/tdbloader4

This, for example, is certainly not exactly what you need, but I am sure that 
with little changes you can get what you want:
https://github.com/castagna/tdbloader4/blob/master/src/main/java/org/apache/jena/tdbloader4/StatsDriver.java

Last but not least, you'll need to dump your RDF data out onto HDFS.
I suggest you use N-Triples/N-Quads serialization formats.

Running SPARQL queries on top of an Hadoop cluster is another (long and not 
easy) story.
But, it might be possible to translate part of the SPARQL algebra into Pig 
Latin scripts and use Pig.
In my opinion however, it makes more sense to use MapReduce to filter/slice 
massive datasets, load the result into a triple store and refine your data 
analysis using SPARQL there.

My 2 cents,
Paolo

Md. Mizanur Rahoman wrote:
> Dear All,
> 
> I want to collect some statistics over RDF data. My triple store is
> Virtuoso and I am using Jena for executing my query.  I want to get some
> statistics like
> i) how many resources in my dataset ii) resources belong to in which
> position of dataset (i.e., sub/prd/obj) etc. As my data is huge, I want to
> use Hadoop Map Reduce in calculating such statistics.
> 
> Can you please suggest.
>

Re: Want to run SPARQL Query with Hadoop Map Reduce Framework

Reply via email to