Hi Paolo, Thanks for your reply.
Right now I am only using DBPedia, Geoname and NYTimes for LOD cloud. And later on I want to extend my dataset. By the way, yes, I can use sparql directly to collect my required statistics but my assumption is using Hadoop could give me some boosting in collecting those stat. I will knock you after going through your links. - Sincerely Md Mizanur On Tue, Jun 26, 2012 at 12:50 AM, Paolo Castagna < castagna.li...@googlemail.com> wrote: > Hi Mizanur, > when you have big RDF datasets, it might make sense to use MapReduce (but > only if you already have an Hadoop cluster at hand. Is this your case?). > You say that your data is 'huge', just for the sake of curiosity... how > many triples/quads is 'huge'? ;-) > Most of the use cases I've seen related to statistics on RDF datasets were > trivial MapReduce jobs. > > For a couple of examples on using MapReduce with RDF datasets have a look > here: > https://github.com/castagna/jena-grande > https://github.com/castagna/tdbloader4 > > This, for example, is certainly not exactly what you need, but I am sure > that with little changes you can get what you want: > > https://github.com/castagna/tdbloader4/blob/master/src/main/java/org/apache/jena/tdbloader4/StatsDriver.java > > Last but not least, you'll need to dump your RDF data out onto HDFS. > I suggest you use N-Triples/N-Quads serialization formats. > > Running SPARQL queries on top of an Hadoop cluster is another (long and > not easy) story. > But, it might be possible to translate part of the SPARQL algebra into Pig > Latin scripts and use Pig. > In my opinion however, it makes more sense to use MapReduce to > filter/slice massive datasets, load the result into a triple store and > refine your data analysis using SPARQL there. > > My 2 cents, > Paolo > > Md. Mizanur Rahoman wrote: > > Dear All, > > > > I want to collect some statistics over RDF data. My triple store is > > Virtuoso and I am using Jena for executing my query. I want to get some > > statistics like > > i) how many resources in my dataset ii) resources belong to in which > > position of dataset (i.e., sub/prd/obj) etc. As my data is huge, I want > to > > use Hadoop Map Reduce in calculating such statistics. > > > > Can you please suggest. > > > -- *Md Mizanur Rahoman* PhD Student The Graduate University for Advanced Studies National Institute of Informatics 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo 101-8430, Japan. Cell # +81-80-4076-9044 email: mi...@nii.ac.jp Web: http://www.nii.ac.jp/en/ & Lecturer, Department of Computer Science & Engineering Begum Rokeya University, Rangpur, Bangladesh. email: mdmizanur.raho...@gmail.com, mi...@brur.ac.bd Cell # +88 01823 806618 Web: http://www.brur.ac.bd