Iterating on Bharath's responses, my answers to each of your questions inline:
On Wed, Jan 23, 2013 at 2:54 PM, Dibyendu Karmakar <dibyendu.d...@gmail.com>wrote: > Hi, > I am doing some performance testing in HADOOP. But while testing, I faced > a situation. I need your help. > > My HADOOP cluster : > 6 Datanodes, 1 Namenode, 2 Clients. > > Replication factor = 3 > > 2 clients write a file(put operation) whose size is 2 x block size. > DFS.DATA.DIR in each Datanodes is equal and is same as block size. That > means each Datanodes stores a single block. > This isn't theoretically correct (a randomity and dependence of client's location exists here in spreading of the blocks), but for a balanced state assumption let it be so. > Now, if 2 clients simultaneously reads the file( get operation), > Will 2 clients read 2 blocks from different Datanodes ? > Or they will read from the same datanodes? > Depends on where the client's location is. If its among the DNs, a local read is incurred. If elsewhere, it is possible that each may read from a unique DN or even the same DN (randomly ordered returns from the NN). But ideally the closest to the DN is picked, at least rack-wise, if the NN is aware of this. > Does Namenode know which Datanode is busy and which one is idle? > NN does health checks upon writes (stuff like space, load and recent availability). At read time, the client does more of a failing-over act, trying DNs one at a time in provided order until one accepts its request, if they are all highly busy. > What I am trying to find is that... > Is it possible to decrease the read time by increasing replication factor? > Yes, more replicas generally mean more available DNs to serve its read, but at the same time it impacts write speeds as there's more synchronous wait to take care of. > I have attached an image to better understand my question. Kindly take a > look. Thank you. And if possible please give references. > "Will access be distributed [for a series of block of the same file]?" - Yes, for remote client reads. Access order is randomized for these form of clients, leading to possibly different patterns each time. -- Harsh J