Steve Loughran wrote:
On 11/11/10 11:02, Adarsh Sharma wrote:
Dear all,

Does anyone have an experience on working Hadoop Integration with SGE (
Sun Grid Engine ).
It is open -source too ( sge-6.2u5 ).
Did SGE really overcomes some of the deficiencies of Hadoop.
According to a article :-

That'll be DanT's posting
http://blogs.sun.com/templedf/entry/leading_the_herd


Instead, to set the stage, let's talk about what Hadoop doesn't do so
well. I currently see two important deficiencies in Hadoop: it doesn't
play well with others, and it has no real accounting framework. Pretty
much every customer I've seen running Hadoop does it on a dedicated
cluster. Why? Because the tasktrackers assume they own the machines on
which they run. If there's anything on the cluster other than Hadoop,
it's in direct competition with Hadoop. That wouldn't be such a big deal
if Hadoop clusters didn't tend to be so huge. Folks are dedicating
hundreds, thousands, or even tens of thousands of machines to their
Hadoop applications. That's a lot of hardware to be walled off for a
single purpose. Are those machines really being used? You may not be
able to tell. You can monitor state in the moment, and you can grep
through log files to find out about past usage (Gah!), but there's no
historical accounting capability there.

So I want to know that is it worthful to use SGE with Hadoop in
Production Cluster or not.
Please share your views.


A permanently allocated set of machines gives you permanent HDFS storage at the cost of SATA HDDs. Once you go to any on-demand infrastructure you need some persistent store, and it tends to lack locality and have a higher cost/GB, usually because it is SAN-based.

Where on-demand stuff is good for is for sharing physical machines, because unless you can keep the CPU+RAM busy in your cluster, that's wasted CAPEX/OPEX budgets.

One thing that's been discussed is to have a physical hadoop cluster, but have the TT's capacity reporting work well with other schedulers, via some plugin point:

https://issues.apache.org/jira/browse/MAPREDUCE-1603

This would let your cluster also accept work from other job execution frameworks, and when busy with that work, report less slots to the TT, though still serve up data to the rest of the hadoop workers

Benefits
 -cost of storage is HDFS rates
 -performance of a normal hadoop cluster
-under-utilised hadoop cluster time can be used by other work schedulers, ones that don't need access to the Hadoop storage.

Costs:
 -HDFS security -can you lock it down?
-your other workloads had better not expect SAN or low-latency interconnect like Infiniband, unless you add them to the cluster too, which bumps up costs.

Nobody has implemented this yet, so volunteers to take up their IDE against Hadoop 0.23 would be welcome. And yes, I do mean 0.23, that's the schedule that would work.

-Steve
Thanks a Lot!   Steve
This is the way to explain other doubts.

Best Regards
-Adarsh

Reply via email to