Re: Merging of files back in hadoop

Ayon Sinha Tue, 19 Apr 2011 08:54:03 -0700

One thing to note is that the HDFS client code fetches the block directly from 
the datanode after obtaining the location info from the name node. That way the 
namenode does not become the bottleneck for all data transfers. The clients 
only 
get the information about the sequence and location from the name node, like 
Bobby mentioned. 
 -Ayon
See My Photos on Flickr
Also check out my Blog for answers to commonly asked questions.

________________________________
From: Robert Evans <ev...@yahoo-inc.com>
To: "hdfs-user@hadoop.apache.org" <hdfs-user@hadoop.apache.org>
Sent: Tue, April 19, 2011 6:37:13 AM
Subject: Re: Merging of files back in hadoop

Re: Merging of files back in hadoop Sherya,

The metadata is all stored in the name node.  It stores where all of the block 
are located and the order of the blocks in a file.   Data is merged as needed 
behind when you call methods on the instance of the java.io.InputStream 
returned 
when calling open.  So, when you open a file for reading you are making a 
connection to one of the machines that has a copy of the first block of the 
file.  As you read the data and you finish with the first block the second 
block 
is then fetched for you from what ever machine has a copy of it and you 
continue 
until all blocks are read.  Typically in map/reduce each mapper, that is 
reading 
data will read one block, and possibly a little bit more from the start of the 
next block.  That way you never have all of the file in memory on any machine. 
 Typically they only process a small part of the block at a time, one key/value 
pair.  However there is nothing stopping you from doing something bad, and 
trying to cache the entire contents of the file in memory as you read it from 
the stream, except that you would eventually get an out of memory exception.

--
Bobby Evans

On 4/19/11 4:19 AM, "Shreya Chakravarty" <shreya_chakrava...@persistent.co.in> 
wrote:

Hi,
> 
>I have a query regarding how Hadoop merges the data back which has been split 
>into blocks and stored on different nodes.
>·        Where is the data merged as we say that the file can be so huge that 
>it 
>doesn’t fit onto one machine
>
>·        Where is the sequence maintained for merging it back.
>
> 
>
>Thanks and Regards,
>Shreya Chakravarty
> 
>DISCLAIMER ========== This e-mail may contain privileged and confidential 
>information which is the property of Persistent Systems Ltd. It is intended 
>only 
>for the use of the individual or entity to which it is addressed. If you are 
>not 
>the intended recipient, you are not authorized to read, retain, copy, print, 
>distribute or use this message. If you have received this communication in 
>error, please notify the sender and delete all copies of this message. 
>Persistent Systems Ltd. does not accept any liability for virus infected 
>mails. 
>
>
>

Re: Merging of files back in hadoop

Reply via email to