Re: read from a hdfs file on the same host as client

2014-10-13 Thread Shivram Mani
Demai, you are right. HDFS's default BlockPlacementPolicyDefault makes sure
one replica of your block is available on the writer's datanode.
The replica selection for the read operation is also aimed at minimizing
bandwidth/latency and will serve the block from the reader's local node.
If you want to further optimize this, you can set
'dfs.client.read.shortcircuit'
to true. This would allow the client to bypass the datanode to read the
file directly.

On Mon, Oct 13, 2014 at 11:58 AM, Demai Ni nid...@gmail.com wrote:

 hi, folks,

 a very simple question, looking forward a couple pointers.

 Let's say I have a hdfs file: testfile, which only have one block(256MB),
 and the block has a replica on datanode: host1.hdfs.com (the whole hdfs
 may have 100 nodes though, and the other 2 replica are available at other
 datanode).

 If on host1.hdfs.com, I did a hadoop fs -cat testfile or a java client
 to read the file. Should I assume there won't be any significant data
 movement through network?  That is the namenode is smart enough to give me
 the data on host1.hdfs.com directly?

 thanks

 Demai




-- 
Thanks
Shivram


Re: read from a hdfs file on the same host as client

2014-10-13 Thread Demai Ni
Shivram,

many thanks for confirming the behavior. I will also turn on the
shortcircuit as you suggested. Appreciate the help

Demai

On Mon, Oct 13, 2014 at 3:42 PM, Shivram Mani sm...@pivotal.io wrote:

 Demai, you are right. HDFS's default BlockPlacementPolicyDefault makes
 sure one replica of your block is available on the writer's datanode.
 The replica selection for the read operation is also aimed at minimizing
 bandwidth/latency and will serve the block from the reader's local node.
 If you want to further optimize this, you can set 
 'dfs.client.read.shortcircuit'
 to true. This would allow the client to bypass the datanode to read the
 file directly.

 On Mon, Oct 13, 2014 at 11:58 AM, Demai Ni nid...@gmail.com wrote:

 hi, folks,

 a very simple question, looking forward a couple pointers.

 Let's say I have a hdfs file: testfile, which only have one block(256MB),
 and the block has a replica on datanode: host1.hdfs.com (the whole hdfs
 may have 100 nodes though, and the other 2 replica are available at other
 datanode).

 If on host1.hdfs.com, I did a hadoop fs -cat testfile or a java client
 to read the file. Should I assume there won't be any significant data
 movement through network?  That is the namenode is smart enough to give me
 the data on host1.hdfs.com directly?

 thanks

 Demai




 --
 Thanks
 Shivram