If you look into FileInputFormat, you'll see that there's a call to
FileSystem.getFileBlockLocations() (line 222) which finds the addresses of
the nodes holding the blocks to be mapped. Each FileSplit generated in that
same getSplits() method contains the list of locations where this split
should be mapped over. These are used by Hadoop's scheduler as a *hint* for
where to run the task. If the TaskTrackers on those nodes are all full for a
prolonged period of time, it may run the task corresponding to that split on
a separate node.

In general, you do not have precise control over which nodes are used for a
given job. This is a key part of Hadoop's reliability; the system is less
brittle because it depends on the availability of particular nodes less. If
you pegged your job to specific nodes, it would probably be less reliable as
a result. So Hadoop doesn't really provide a convenient way for you to do
this.

- Aaron

On Wed, Jul 8, 2009 at 1:08 PM, Sean Arietta <sarie...@virginia.edu> wrote:

>
> I have a rather specific question. Hopefully someone can help me get to the
> bottom of this.
>
> I need to be able to run a piece of code on an arbitrary number of physical
> nodes. My initial thought was that I could trick the Hadoop API into
> executing code on these nodes by emitting splits via a FileInputFormat
> whose
> hosts were set to different physical nodes. This would look something like:
>
> nodes = getNodeList()
> foreach node in nodes
> create split with hosts = node.getHostName()
>
> Now for my questions. The first is whether this will actually enforce my
> goal? I know that Hadoop attempts to move code to where the data is, but is
> this enough to trick the API or does it actually do something more clever
> like inspect the file that is past along with the split?
>
> Second, I cannot even succeed in performing the above task because I cannot
> seem to find a way to get a list of the current live data nodes. I have
> been
> all through the API and I have found some methods that can access this
> information, but I cannot access those methods. Specifically, I found
> FSNamesystem (which I cannot access) and JspHelper which complains of a
> null
> pointer reference when I attempt to call the default constructor on it. So,
> the second question is does anyone know a way to get a list of the live
> data
> nodes from within a map reduce program?
>
> Thanks a lot for your help!
>
> Cheers,
> Sean
> --
> View this message in context:
> http://www.nabble.com/Forcing-Many-Map-Nodes-tp24398403p24398403.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>

Reply via email to