Re: Hama and Data Locality

Suraj Menon Mon, 02 Apr 2012 09:39:51 -0700

Hi Praveen,

I did not run any experiments with multiple racks to support my claim
although I intend to.
But it seems logical that in pursuit of data-locality, there is a good
chance that few of our tasks can get scheduled on different racks.
Now this helps reduce the time to read all inputs from the input file, but
Hama performance is also dependent upon how fast the BSP peer nodes can
transfer all the messages across each other in subsequent supersteps.
Unlike mapper task who completes after reading all the input records, the
bsp task continues running on the same machine where it started. So for
example, if the task requires 1000 supersteps, we would have got the first
superstep to complete in shortest time with data locality but would delay
the remaining 999 supersteps because of the increased delay in transferring
messages. Hence the opinion that topological information should be used
with higher priority than the data-locality needs. I know I should back
every claim with data, I would be doing it soon with my VM setup. :)


Also in the meantime, I came across this -
https://issues.apache.org/jira/browse/HDFS-385
We should keep this in mind for pure Hama clusters.

-Suraj

On Mon, Apr 2, 2012 at 12:17 PM, Praveen Sripati
<[email protected]>wrote:

> > https://issues.apache.org/jira/browse/HAMA-543
>
> > While working on it, I realized that this won't necessarily improve the
> performance, because the resource requirements for Hama is different from
> Hadoop. This change would move the mapper tasks closer to the input as in
> Hadoop. But in case of Hama tasks continue running on that machine
> throughout its lifetime. If in search of data-locality, the tasks get
> scheduled such that the communication between the nodes are costlier than
> normal (e.g. tasks resident in separate racks), then this change would
> degrade the performance.
>
> Doesn't data locality improve the performance of Hama?
>
> Praveen
>

Re: Hama and Data Locality

Reply via email to