If the only thing that you're running is Hadoop, it's probably not worth it today. The big win for using Grid Engine with Hadoop is that it lets you consolidate your Hadoop cluster onto the same resources as your other workloads, like MPI, batch, whatever. If all you're doing is Hadoop, then that isn't an issue. The accounting piece is nice, but there are probably other ways to solve that problem that would be less invasive.

The other piece where Grid Engine brings value is the scheduler. Hadoop has a decent scheduler these days, but the scheduler in Grid Engine has had two decades of improvements and tuning put into it. (And I mean actual decades, not man-decades.) If you want things like advance reservation, starvation prevention, fine-grained resource quotas, fine-grained preemption, complex fair-share, deep awareness of heterogeneous resource pools, etc, then Grid Engine might be worth considering, even if all you do is Hadoop.

Incidentally, Grid Engine also helps with the problem on not having redundant JobTrackers. Every Hadoop job running under Grid Engine gets its own JobTracker.

Going forward, Grid Engine will continue to expand its Hadoop support. There are a couple of other "big ticket" issues with running Hadoop in an enterprise IT environment that Grid Engine seems to be well suited to solving.

Just FYI, I am the product manager at Oracle for the Grid Engine product, and I wrote the Hadoop integration as the last thing I did before leaving engineering. What I've written above is my (obviously biased) view of things. I would love to hear feedback from anyone who has either looked at the integration or simply takes issue with anything I said above. I know there are at least two customers out there with Grid Engine managing their Hadoop workloads. I'd love to find others.

Daniel

-------- Original Message --------
Subject: Deficiency in Hadoop
Date: Thu, 11 Nov 2010 16:32:30 +0530
From: Adarsh Sharma <adarsh.sha...@orkash.com>
Reply-To: common-user@hadoop.apache.org
To: common-user@hadoop.apache.org

Dear all,

Does anyone have an experience on working Hadoop Integration with SGE (
Sun Grid Engine ).
It is open -source too ( sge-6.2u5 ).
Did SGE really overcomes some of the deficiencies of Hadoop.
According to a article :-

Instead, to set the stage, let's talk about what Hadoop doesn't do so
well. I currently see two important deficiencies in Hadoop: it doesn't
play well with others, and it has no real accounting framework. Pretty
much every customer I've seen running Hadoop does it on a dedicated
cluster. Why? Because the tasktrackers assume they own the machines on
which they run. If there's anything on the cluster other than Hadoop,
it's in direct competition with Hadoop. That wouldn't be such a big deal
if Hadoop clusters didn't tend to be so huge. Folks are dedicating
hundreds, thousands, or even tens of thousands of machines to their
Hadoop applications. That's a lot of hardware to be walled off for a
single purpose. Are those machines really being used? You may not be
able to tell. You can monitor state in the moment, and you can grep
through log files to find out about past usage (Gah!), but there's no
historical accounting capability there.

So I want to know that is it worthful to use SGE with Hadoop in
Production Cluster or not.
Please share your views.

Thanks in Advance
Adarsh Sharma



Reply via email to