Re: An experiment with tdbloader3 (i.e. how to create TDB indexes with MapReduce)

nat lu Thu, 03 Nov 2011 08:43:08 -0700

I can see how that sentence needs clarification :-0

"how hadoop might fit into the RDF picture". Or even "how hadoop mightfit into the semantic repository picture"

wrt SDB, the advantage over TDB (unless work in the past 9-12 months haschanged this - correct me) is having concurrent access from differentJVMs - eg 2 app servers on 2 nodes talking to the same repository (andnot via some intermediate sparql endpoint). And in the enterprise makinguse of any skills or resources with traditional RDBMS or "database"experience is always attractive. Replication, tuning, backup etc

So, my thoughts appears to be inline with Andy atm - generating somekind of load format for either programmatic or cmdline bulk load thatmost database systems seems like good use of hadoop (probably notdissimilar to the map pipeline you've done here), albeit as an ETLmechanism. Indeed understanding how best to read an RDF "file"consistenly, so that it can be made use of in a Map job seems like thefirst step, and then mapping that to some database schema - normalizedor not. I'm not sure if you'd agree, and I'm not sure myself that it isthe best approach. There is the hadoop overhead of course and that needsto be measured against a bulk load performance on a single but multicorenode.

So, I am just shooting the breeze here really. I'm doing some work withhadoop at the moment, and because of its relatively low learning curve(for basic usage at least) and some of the promise it has in terms ofperformance, if there is technological room for parallelism when itcomes to loading triples and/or query performance then I am interestedin exploring it as much as I can.




On 03/11/11 13:19, Paolo Castagna wrote:

nat lu wrote:

Impressive work, was wondering how hadoop might fit into the hadooppicture.


Hi,
can you clarify what you mean with "how hadoop might fit into the
hadoop picture"?

tdbloader3 is just a very specific (and probably not trivial)
MapReduce job/use cases. So, it's a practical use case for Hadoop.
Building dictionaries or B+Tree is one of the good examples where
MapReduce does not perfectly fit (however... there are potentially
significant advantages in making your algorithm fit into the
MapReduce paradigm model). PageRank (I know from experience) is
another good example where MapReduce isn't ideal. :-)

As far as I am aware, there are no publicly examples of people using
MapReduce to build B+Tree indexes. Although, it's not rocket science
since mainly you need to sort stuff and Hadoop/MapReduce are great
at sorting stuff.

> Any thoughts on doing this with SDB ?

One reason why something such as tdbloader3 is possible is because
having the source code, it's possible to see the binary format of
TDB indexes.

While in theory would be possible to do something similar with some
of the open source DBMS systems supported by SDB, I do not see how
a similar approach could be employed with closed source DBMS systems.

Why are you asking specifically about SDB?
What SDB gives you that you do not have in TDB?

Transactions is not an answer anymore (we still need more testing,
but that barrier is gone).

By the way, thanks for the "impressive", I appreciate.

However, as I said, to have good (or impressive) overall numbers I think

we need much larger clusters and those are not easily available topeople.

So the efficiency and utility of something like tdbloader3 is somehow
limited by that.

Last Christmas I read a couple of books on CUDA and I still have 200
or more cores on my GPU sitting idle most of the time. Maybe, this
Christmas is time to think if I can put that into practice and use GPUs
to generate TDB indexes. This might be more readily available and useful
to people... many love computer games! ;-)

Any CUDA (and jCUDA) expert here on jena-users?

From my experience with tdbloader3 and parallel processing, I say that
the fact that current node ids (currently 64 bits) are offsets in the
nodes.dat file is a big "impediment" to distributed/parallel processing.
Mainly, because whatever you do, you first need to build a dictionary
for that and this is not trivial in parallel.
However, if we could, given an RDF node value generate a node id with
an hash function (sufficiently big so that the probability of collision
is less than being hit by an asteroid) (128 bits?) then tdbloader3 could
be massively simplified, merging TDB indexes directly will become trivial
(as for Lucene indexes), ... my life at work would be so much simpler!

The drawback of 128 bit node ids is that suddenly you might need to
double your RAM to achieve same performances (to be proven and verified
with experiments). However, there are many other good side effects that
you can fit into 128 bits. For example, I am not so sure anymore if
an optimization such as the one proposed on JENA-144 is possible without
ensuring that all node values can be encoded in the bits available in
the node id: https://issues.apache.org/jira/browse/JENA-144.

Paolo



On 03/11/11 11:38, Paolo Castagna wrote:

Hi,
I want to share the results of an experiment I did using tdbloader3
with an Hadoop cluster (provisioned via Apache Whirr) running on EC2.

tdbloader3 generates TDB indexes with a sequence of four MapReducejobs.

tdbloader3 is ASL but it is not an official module for Apache Jena.

At the moment it's a prototype/experiment and this is the reason why
its sources are on GitHub: https://github.com/castagna/tdbloader3

*If* it turns out to be useful to people and *if* there will be demand,
I have no problems in contributing this to Apache Jena as a separate
module (and support it). But, I suspect there aren't many people with
Hadoop clusters of ~100 nodes and RDF datasets of billion of triples
or quads and the need to generate TDB indexes. :-)

For now GitHub works well. More testing and experiments are necessary.

The dataset I used in my experiment is an Open Library version in RDF:
962,071,613 triples, I found it somewhere in the office in our S3
account. I am not sure if it is publicly available. Sorry.

To run the experiment I used a 17 nodes Hadoop cluster on EC2.

This is the Apache Whirr recipe I used to provision the cluster:

---------[ hadoop-ec2.properties ]----------
whirr.cluster-name=hadoop
whirr.instance-templates=1 hadoop-namenode+hadoop-jobtracker,16
hadoop-datanode+hadoop-tasktracker
whirr.instance-templates-max-percent-failures=100
hadoop-namenode+hadoop-jobtracker,80
hadoop-datanode+hadoop-tasktracker
whirr.max-startup-retries=1
whirr.provider=aws-ec2
whirr.identity=${env:AWS_ACCESS_KEY_ID_LIVE}
whirr.credential=${env:AWS_SECRET_ACCESS_KEY_LIVE}
whirr.hardware-id=c1.xlarge
whirr.location-id=us-east-1
whirr.image-id=us-east-1/ami-1136fb78
whirr.private-key-file=${sys:user.home}/.ssh/whirr
whirr.public-key-file=${whirr.private-key-file}.pub
whirr.hadoop.version=0.20.204.0

whirr.hadoop.tarball.url=http://archive.apache.org/dist/hadoop/core/hadoop-${whirr.hadoop.version}/hadoop-${whirr.hadoop.version}.tar.gz

---------

The recipe specifies a 17 nodes cluster, it also specifies that a
20% failure rate during the startup of the cluster is acceptable.
Only 15 datanodes/tasktrackers were successfully started and joined
the Hadoop cluster. Moreover, during the processing, 2 datanodes/
tasktrackers died. Hadoop coped with that with no problems.

As you can see, I used c1.xlarge instances which have 7 GB of RAM,
20 "EC2 compute units" (8 virtual cores with 2.5 EC2 compute units
each) and 1.6 GB of dist space. Price of c1.xlarge instances in
US East (Virginia) is $0.68 per hour.

These are the elapsed times for the jobs I run on the cluster:

            distcp: 1hrs, 19mins (1)
  tdbloader3-stats: 1hrs, 22mins (2)
  tdbloader3-first: 1hrs, 48mins (3)
 tdbloader3-second: 2hrs, 40mins
  tdbloader3-third: 49mins
 tdbloader3-fourth: 1hrs, 36mins
          download: -            (4)

(1) In my case, distcp copied 112 GB (uncompressed) from S3 onto HDFS.

(2) tdbloader3-stats is optional and it is not strictly required for
the TDB indexes, but it is useful to produce the stats.opt file.
It can probably be merged into one of tdbloader3-first|second jobs.

(3) considering the tdbloader3-{first-fourth} jobs only, they took ~7
hours, therefore an overall speed of about 38,000 triples per second.
Not impressive and a bit too expensive to be run on a cluster running
on the cloud (it's a good business for Amazon though).

(4) it depends on the bandwidth available. This step can be improved
compressing the nodes.dat files and part of what download does can be
done as a 10 map only job.

For those lucky enough to already have a 40-100 nodes Hadoop cluster,
tdbloader3 can be an option alongside tdbloader and tdbloader2.
Certainly not a replacement of these two great tools + as much RAM as
you can, for the rest of us. :-)

Last but not least, if you are reading this and you are one of those
lucky enough to have a (sometimes idle) Hadoop cluster of 40-100 nodes
and you would like to help me out with the testing of tdbloader3 or run
a few experiment with it, please, get in touch with me.

Cheers,
Paolo


PS:
Apache Whirr (http://whirr.apache.org/) is a great project to quickly

have an Hadoop cluster up and running for testing or runningexperiments.Moreover, Whirr is not limited to Hadoop, it has support for:ZooKeeper,

HBase, Cassandra, ElasticSearch, etc.)

EC2 is great for short live, bursts and the easy/elastic provisioning.
However, it is probably not the best choice for long-running clusters.
If you use m1.small instances to run you Hadoop cluster you are likely
to have problems with RAM.

Re: An experiment with tdbloader3 (i.e. how to create TDB indexes with MapReduce)

Reply via email to