Hi Mizanur, when you have big RDF datasets, it might make sense to use MapReduce (but only if you already have an Hadoop cluster at hand. Is this your case?). You say that your data is 'huge', just for the sake of curiosity... how many triples/quads is 'huge'? ;-) Most of the use cases I've seen related to statistics on RDF datasets were trivial MapReduce jobs.
For a couple of examples on using MapReduce with RDF datasets have a look here: https://github.com/castagna/jena-grande https://github.com/castagna/tdbloader4 This, for example, is certainly not exactly what you need, but I am sure that with little changes you can get what you want: https://github.com/castagna/tdbloader4/blob/master/src/main/java/org/apache/jena/tdbloader4/StatsDriver.java Last but not least, you'll need to dump your RDF data out onto HDFS. I suggest you use N-Triples/N-Quads serialization formats. Running SPARQL queries on top of an Hadoop cluster is another (long and not easy) story. But, it might be possible to translate part of the SPARQL algebra into Pig Latin scripts and use Pig. In my opinion however, it makes more sense to use MapReduce to filter/slice massive datasets, load the result into a triple store and refine your data analysis using SPARQL there. My 2 cents, Paolo Md. Mizanur Rahoman wrote: > Dear All, > > I want to collect some statistics over RDF data. My triple store is > Virtuoso and I am using Jena for executing my query. I want to get some > statistics like > i) how many resources in my dataset ii) resources belong to in which > position of dataset (i.e., sub/prd/obj) etc. As my data is huge, I want to > use Hadoop Map Reduce in calculating such statistics. > > Can you please suggest. >