There isn't an API way to hint/select DNs to read from currently - you may need to do manual changes (contribution of such a feature is welcome, please file a JIRA to submit a proposal).
You can perhaps hook your control of which replica location for a given block is selected by the reader under the non-public method DFSInputStream#getBestNodeDNAddrPair(…): https://github.com/apache/hadoop/blob/release-2.7.0/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java#L982-L1021 (ensure to preserve the existing logic around edge cases, however) Note though that the block replica location list returned for off-rack reads by the NameNode are randomized by default. Are you observing a non-random distribution of reads? On Wed, 19 Jul 2017 at 06:16 Shivram Mani <sm...@pivotal.io> wrote: > We have an application which uses the DFSInputStream to read blocks from a > *remote* hadoop cluster. Is there any way we can influence which specific > datanode the block fetch request is dispatched to ? > > The reasoning behind this is since our application workload is very heavy > on IO , we would like to distribute the IO load as evenly as possible > across the hosts/disks. Hence prior to reading data, we wish to obtain the > location of the underlying blocks and build a dispatch plan so as to > maximize the IO throughput on the HDFS cluster. > > How do we go about this ? > > > -- > Thanks > Shivram >