Re: hadoop performance with very small cluster

Brian Bockelman Thu, 21 May 2009 06:12:24 -0700


On May 21, 2009, at 2:30 AM, Miles Osborne wrote:

if you mean "hadoop does not give a speed-up compared with a
sequential version" then this is because of overhead associated with
running the framework:  your job will need to be scheduled, JVMs
instantiated, data copied, data sorted etc etc.


Eric,

It depends on your problem. If you have a java program that's gotlots of CPU per key and already is map-reduce-like, you'll probablysee pretty good efficiency. If you have a highly optimized assemblerprogram that runs in seconds, you'll probably see poor"efficiency" (however you might be defining that).

Let's say you have N machines and the program takes L seconds on 1machine. Assume that the overhead is 10 seconds for frameworkinitialization (perhaps conservative?). Then, the total runtime is L/N + 10; the speedup is L/(L/N+10).

Now, plug in estimates for your cluster. If L->0 or N->infinity, thenthe dominate term in the expression is the 10 seconds forinitialization. So, if N=5 and your original problem took 1 minute,your maximum speedup is about 3. If your initial problem took 1 hour,the maximum speedup is about 5. (Look up Amdahl's law, that's allI'm applying...)

So, like the answer to most general questions, the answer is "itdepends". For the most part, it depends wholly on whether yourproblem can be parallelized and your problem runtime versus the Hadoopoverhead. Even if Hadoop might not provide a huge speedup currently,I'd add to Miles' comment: not only would the solution be easier tomaintain, but it would also be easier to grow when you decide youneed, say, 100 machines to process your problem.


Brian



if your jobs can be parallelised and you have enough machines (your
cluster is large enough) then the ability to use more machines should
compensate for the framework overhead.

even if your sequential / hacked version running on a small cluster
beats the hadoop version, in my mind a major advantage of Hadoop (and
this is something that people tend to forget) is that your Hadoop
version almost certainly will be simpler and easier to maintain.

Miles

2009/5/21 zhu hui <chinazhuhu...@gmail.com>:

hello, everybody.
i am fresh to hadoop, and i heard from others that hadoop performsnot
efficient when the cluster is very small,for example 6 machines.
but i cannot find out the reasons and materials that i can makethem as the
proofs.
thanks very much if anybody who can share me with some materials orideas.
Best Wishes.

Eric.Syu

--
Nothing Impossible




--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

Re: hadoop performance with very small cluster

Reply via email to