On May 21, 2009, at 2:30 AM, Miles Osborne wrote:
if you mean "hadoop does not give a speed-up compared with a
sequential version" then this is because of overhead associated with
running the framework: your job will need to be scheduled, JVMs
instantiated, data copied, data sorted etc etc.
Eric,
It depends on your problem. If you have a java program that's got
lots of CPU per key and already is map-reduce-like, you'll probably
see pretty good efficiency. If you have a highly optimized assembler
program that runs in seconds, you'll probably see poor
"efficiency" (however you might be defining that).
Let's say you have N machines and the program takes L seconds on 1
machine. Assume that the overhead is 10 seconds for framework
initialization (perhaps conservative?). Then, the total runtime is L/
N + 10; the speedup is L/(L/N+10).
Now, plug in estimates for your cluster. If L->0 or N->infinity, then
the dominate term in the expression is the 10 seconds for
initialization. So, if N=5 and your original problem took 1 minute,
your maximum speedup is about 3. If your initial problem took 1 hour,
the maximum speedup is about 5. (Look up Amdahl's law, that's all
I'm applying...)
So, like the answer to most general questions, the answer is "it
depends". For the most part, it depends wholly on whether your
problem can be parallelized and your problem runtime versus the Hadoop
overhead. Even if Hadoop might not provide a huge speedup currently,
I'd add to Miles' comment: not only would the solution be easier to
maintain, but it would also be easier to grow when you decide you
need, say, 100 machines to process your problem.
Brian
if your jobs can be parallelised and you have enough machines (your
cluster is large enough) then the ability to use more machines should
compensate for the framework overhead.
even if your sequential / hacked version running on a small cluster
beats the hadoop version, in my mind a major advantage of Hadoop (and
this is something that people tend to forget) is that your Hadoop
version almost certainly will be simpler and easier to maintain.
Miles
2009/5/21 zhu hui <chinazhuhu...@gmail.com>:
hello, everybody.
i am fresh to hadoop, and i heard from others that hadoop performs
not
efficient when the cluster is very small,for example 6 machines.
but i cannot find out the reasons and materials that i can make
them as the
proofs.
thanks very much if anybody who can share me with some materials or
ideas.
Best Wishes.
Eric.Syu
--
Nothing Impossible
--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.