Currently, pseudo-distributed mode is *much* slower than "local" mode. It makes sense that running a trivial task on 100 nodes might take longer than running it standalone, but running it on one node over localhost should not be that much slower. In part this is due to task jvm startup time, but I think the larger part of the blame is heartbeat intervals.

The tasktracker polls for new tasks only every heartbeat interval. When running small jobs in small clusters, this interval dominates performance. But in larger clusters a short heartbeat interval would overload the jobtracker. Perhaps the tasktracker should instead get its heartbeat interval from the jobtracker. The jobtracker could return a small interval when few tasktrackers are known, and a larger interval when lots of tasktrackers are known. This would make small clusters more responsive.

One could use a similar mechanism in dfs.

This is a very low priority issue that I just wanted to get out of my head.

Doug

Reply via email to