Amin Astaneh wrote:
Lukáš-
Hi Amin,
I am not familiar with SGE, do you think you could tell me what did
you get
from this combination? What is the benefit of running Hadoop on SGE?
Sun Grid Engine is a distributed resource management platform for
supercomputing centers. We use it to allocate resources to a
supercomputing task, such as requesting 32 processors to run a
particular simulation. This mechanism is analogous to the scheduler on a
multi-user OS. What I was able to accomplish was to turn Hadoop into an
as-needed service. When you submit a job request to run Hadoop as the
documentation describes, a Hadoop cluster of arbitrary size is
instantiated depending on how many nodes were requested by generating a
cluster configuration specific to that job request. This allows the
Hadoop cluster to be deployed within the context of Gridengine, as well
as being able to coexist with other running simulations on the cluster.
To the researcher or user needing to run a mapreduce code, all they need
to worry about is telling Hadoop to execute it as well as determining
how many machines should be dedicated to the task. This benefit makes
Hadoop very accessible to people since they don't need to worry about
configuring a cluster, SGE and it's helper scripts do it for them.
As Steve Loughran accurately commented, as of now we can only run one
set of Hadoop slave processes per machine, due to the network binding
issue. That problem is mitigated by configuring SGE to spread the slaves
one per machine automatically to avoid failures.
Only the Namenode and JobTracker need hard-coded/well-known port
numbers, the rest could all be done dynamically.
One thing SGE does offer over Xen-hosted images is better performance
than virtual machines, for both CPU and storage, as virtualised disk
performance can be awful, and even on the latest x86 parts, there is a
measurable hit from VM overheads.