Hi Paolo,

Thanks for your reply.

Right now I am only using DBPedia, Geoname and NYTimes for LOD cloud. And
later on I want to extend my dataset.

By the way, yes, I can use sparql directly to collect my required
statistics but my assumption is using Hadoop could give me some boosting in
collecting those stat.

I will knock you after going through your links.

-
Sincerely
Md Mizanur



On Tue, Jun 26, 2012 at 12:50 AM, Paolo Castagna <
castagna.li...@googlemail.com> wrote:

> Hi Mizanur,
> when you have big RDF datasets, it might make sense to use MapReduce (but
> only if you already have an Hadoop cluster at hand. Is this your case?).
> You say that your data is 'huge', just for the sake of curiosity... how
> many triples/quads is 'huge'? ;-)
> Most of the use cases I've seen related to statistics on RDF datasets were
> trivial MapReduce jobs.
>
> For a couple of examples on using MapReduce with RDF datasets have a look
> here:
> https://github.com/castagna/jena-grande
> https://github.com/castagna/tdbloader4
>
> This, for example, is certainly not exactly what you need, but I am sure
> that with little changes you can get what you want:
>
> https://github.com/castagna/tdbloader4/blob/master/src/main/java/org/apache/jena/tdbloader4/StatsDriver.java
>
> Last but not least, you'll need to dump your RDF data out onto HDFS.
> I suggest you use N-Triples/N-Quads serialization formats.
>
> Running SPARQL queries on top of an Hadoop cluster is another (long and
> not easy) story.
> But, it might be possible to translate part of the SPARQL algebra into Pig
> Latin scripts and use Pig.
> In my opinion however, it makes more sense to use MapReduce to
> filter/slice massive datasets, load the result into a triple store and
> refine your data analysis using SPARQL there.
>
> My 2 cents,
> Paolo
>
> Md. Mizanur Rahoman wrote:
> > Dear All,
> >
> > I want to collect some statistics over RDF data. My triple store is
> > Virtuoso and I am using Jena for executing my query.  I want to get some
> > statistics like
> > i) how many resources in my dataset ii) resources belong to in which
> > position of dataset (i.e., sub/prd/obj) etc. As my data is huge, I want
> to
> > use Hadoop Map Reduce in calculating such statistics.
> >
> > Can you please suggest.
> >
>



-- 

*Md Mizanur Rahoman*
PhD Student
The Graduate University for Advanced Studies
National Institute of Informatics
2-1-2 Hitotsubashi, Chiyoda-ku,
Tokyo 101-8430, Japan.
Cell # +81-80-4076-9044
email: mi...@nii.ac.jp
Web: http://www.nii.ac.jp/en/

&

Lecturer, Department of Computer Science & Engineering
Begum Rokeya University, Rangpur, Bangladesh.
email: mdmizanur.raho...@gmail.com, mi...@brur.ac.bd
Cell # +88 01823 806618
Web: http://www.brur.ac.bd

Reply via email to