There isn't an API way to hint/select DNs to read from currently - you may
need to do manual changes (contribution of such a feature is welcome,
please file a JIRA to submit a proposal).

You can perhaps hook your control of which replica location for a given
block is selected by the reader under the non-public method
DFSInputStream#getBestNodeDNAddrPair(…):
https://github.com/apache/hadoop/blob/release-2.7.0/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java#L982-L1021
(ensure
to preserve the existing logic around edge cases, however)

Note though that the block replica location list returned for off-rack
reads by the NameNode are randomized by default. Are you observing a
non-random distribution of reads?

On Wed, 19 Jul 2017 at 06:16 Shivram Mani <sm...@pivotal.io> wrote:

> We have an application which uses the DFSInputStream to read blocks from a
> *remote* hadoop cluster. Is there any way we can influence which specific
> datanode the block fetch request is dispatched to ?
>
> The reasoning behind this is since our application workload is very heavy
> on IO , we would like to distribute the IO load as evenly as possible
> across the hosts/disks. Hence prior to reading data, we wish to obtain the
> location of the underlying blocks and build a dispatch plan so as to
> maximize the IO throughput on the HDFS cluster.
>
> How do we go about this ?
>
>
> --
> Thanks
> Shivram
>

Reply via email to