On May 21, 2009, at 2:30 AM, Miles Osborne wrote:

if you mean "hadoop does not give a speed-up compared with a
sequential version" then this is because of overhead associated with
running the framework:  your job will need to be scheduled, JVMs
instantiated, data copied, data sorted etc etc.

Eric,

It depends on your problem. If you have a java program that's got lots of CPU per key and already is map-reduce-like, you'll probably see pretty good efficiency. If you have a highly optimized assembler program that runs in seconds, you'll probably see poor "efficiency" (however you might be defining that).

Let's say you have N machines and the program takes L seconds on 1 machine. Assume that the overhead is 10 seconds for framework initialization (perhaps conservative?). Then, the total runtime is L/ N + 10; the speedup is L/(L/N+10).

Now, plug in estimates for your cluster. If L->0 or N->infinity, then the dominate term in the expression is the 10 seconds for initialization. So, if N=5 and your original problem took 1 minute, your maximum speedup is about 3. If your initial problem took 1 hour, the maximum speedup is about 5. (Look up Amdahl's law, that's all I'm applying...)

So, like the answer to most general questions, the answer is "it depends". For the most part, it depends wholly on whether your problem can be parallelized and your problem runtime versus the Hadoop overhead. Even if Hadoop might not provide a huge speedup currently, I'd add to Miles' comment: not only would the solution be easier to maintain, but it would also be easier to grow when you decide you need, say, 100 machines to process your problem.

Brian



if your jobs can be parallelised and you have enough machines (your
cluster is large enough) then the ability to use more machines should
compensate for the framework overhead.

even if your sequential / hacked version running on a small cluster
beats the hadoop version, in my mind a major advantage of Hadoop (and
this is something that people tend to forget) is that your Hadoop
version almost certainly will be simpler and easier to maintain.

Miles

2009/5/21 zhu hui <chinazhuhu...@gmail.com>:
hello, everybody.

i am fresh to hadoop, and i heard from others that hadoop performs not
efficient when the cluster is very small,for example 6 machines.

but i cannot find out the reasons and materials that i can make them as the
proofs.

thanks very much if anybody who can share me with some materials or ideas.

Best Wishes.

Eric.Syu

--
Nothing Impossible




--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

Reply via email to