Re: Deficiency in Hadoop

Adarsh Sharma Thu, 11 Nov 2010 04:46:04 -0800

Steve Loughran wrote:

On 11/11/10 11:02, Adarsh Sharma wrote:
Dear all,
Does anyone have an experience on working Hadoop Integration with SGE (
Sun Grid Engine ).
It is open -source too ( sge-6.2u5 ).
Did SGE really overcomes some of the deficiencies of Hadoop.
According to a article :-
That'll be DanT's posting
http://blogs.sun.com/templedf/entry/leading_the_herd
Instead, to set the stage, let's talk about what Hadoop doesn't do so
well. I currently see two important deficiencies in Hadoop: it doesn't
play well with others, and it has no real accounting framework. Pretty
much every customer I've seen running Hadoop does it on a dedicated
cluster. Why? Because the tasktrackers assume they own the machines on
which they run. If there's anything on the cluster other than Hadoop,
it's in direct competition with Hadoop. That wouldn't be such a big deal
if Hadoop clusters didn't tend to be so huge. Folks are dedicating
hundreds, thousands, or even tens of thousands of machines to their
Hadoop applications. That's a lot of hardware to be walled off for a
single purpose. Are those machines really being used? You may not be
able to tell. You can monitor state in the moment, and you can grep
through log files to find out about past usage (Gah!), but there's no
historical accounting capability there.

So I want to know that is it worthful to use SGE with Hadoop in
Production Cluster or not.
Please share your views.
A permanently allocated set of machines gives you permanent HDFSstorage at the cost of SATA HDDs. Once you go to any on-demandinfrastructure you need some persistent store, and it tends to lacklocality and have a higher cost/GB, usually because it is SAN-based.
Where on-demand stuff is good for is for sharing physical machines,because unless you can keep the CPU+RAM busy in your cluster, that'swasted CAPEX/OPEX budgets.
One thing that's been discussed is to have a physical hadoop cluster,but have the TT's capacity reporting work well with other schedulers,via some plugin point:
https://issues.apache.org/jira/browse/MAPREDUCE-1603
This would let your cluster also accept work from other job executionframeworks, and when busy with that work, report less slots to the TT,though still serve up data to the rest of the hadoop workers
Benefits
 -cost of storage is HDFS rates
 -performance of a normal hadoop cluster
-under-utilised hadoop cluster time can be used by other workschedulers, ones that don't need access to the Hadoop storage.
Costs:
 -HDFS security -can you lock it down?
-your other workloads had better not expect SAN or low-latencyinterconnect like Infiniband, unless you add them to the cluster too,which bumps up costs.
Nobody has implemented this yet, so volunteers to take up their IDEagainst Hadoop 0.23 would be welcome. And yes, I do mean 0.23, that'sthe schedule that would work.
-Steve

Thanks a Lot!   Steve
This is the way to explain other doubts.

Best Regards
-Adarsh

Re: Deficiency in Hadoop

Reply via email to