Hi Praveen, I did not run any experiments with multiple racks to support my claim although I intend to. But it seems logical that in pursuit of data-locality, there is a good chance that few of our tasks can get scheduled on different racks. Now this helps reduce the time to read all inputs from the input file, but Hama performance is also dependent upon how fast the BSP peer nodes can transfer all the messages across each other in subsequent supersteps. Unlike mapper task who completes after reading all the input records, the bsp task continues running on the same machine where it started. So for example, if the task requires 1000 supersteps, we would have got the first superstep to complete in shortest time with data locality but would delay the remaining 999 supersteps because of the increased delay in transferring messages. Hence the opinion that topological information should be used with higher priority than the data-locality needs. I know I should back every claim with data, I would be doing it soon with my VM setup. :)
Also in the meantime, I came across this - https://issues.apache.org/jira/browse/HDFS-385 We should keep this in mind for pure Hama clusters. -Suraj On Mon, Apr 2, 2012 at 12:17 PM, Praveen Sripati <[email protected]>wrote: > > https://issues.apache.org/jira/browse/HAMA-543 > > > While working on it, I realized that this won't necessarily improve the > performance, because the resource requirements for Hama is different from > Hadoop. This change would move the mapper tasks closer to the input as in > Hadoop. But in case of Hama tasks continue running on that machine > throughout its lifetime. If in search of data-locality, the tasks get > scheduled such that the communication between the nodes are costlier than > normal (e.g. tasks resident in separate racks), then this change would > degrade the performance. > > Doesn't data locality improve the performance of Hama? > > Praveen >
