I've been playing with a 6 node cluster to test the feasibility of using
hadoop to deal with a large data set we have been struggling to
comprehend (7TB). I have understood all along that I would need a much
larger cluster if we decided to go ahead with hadoop but I must say that
my initial reaction was that there was a lot of communication overhead.
I suspect the change you have described would have lead to a much
different first impression.
Doug Cutting wrote:
Currently, pseudo-distributed mode is *much* slower than "local" mode.
It makes sense that running a trivial task on 100 nodes might take
longer than running it standalone, but running it on one node over
localhost should not be that much slower. In part this is due to task
jvm startup time, but I think the larger part of the blame is heartbeat
intervals.
The tasktracker polls for new tasks only every heartbeat interval. When
running small jobs in small clusters, this interval dominates
performance. But in larger clusters a short heartbeat interval would
overload the jobtracker. Perhaps the tasktracker should instead get its
heartbeat interval from the jobtracker. The jobtracker could return a
small interval when few tasktrackers are known, and a larger interval
when lots of tasktrackers are known. This would make small clusters
more responsive.
One could use a similar mechanism in dfs.
This is a very low priority issue that I just wanted to get out of my head.
Doug