RE: Does FileSplit respect the record boundary?

Vinayakumar B Fri, 10 Feb 2012 23:20:53 -0800

Hi Zhu,

Ø  The LineRecordReader will get the path in the HDFS itself, not on the
LocalFileSystem,

But its the NameNode who gives the list of DataNodes for a particular
block, sorted by the Distance from the Client. i.e. Here Machine where Task
is Running.

Ø  For the line which ends in next block, HDFS only will take care of
getting the next block information from NameNode and give it to LineReader.
Line Reader will just continue reading without worrying about the location
of the block. 

o    One Suggestion to get the better performance is set the split size for
the job same as the block size of the input file.  If the split size is more
than the block size then Task may need to get the block data from multiple
datanodes.

Thanks and Regards,

Vinayakumar B

From: GUOJUN Zhu [mailto:guojun_...@freddiemac.com] 
Sent: Saturday, February 11, 2012 3:50 AM
To: mapreduce-user@hadoop.apache.org
Subject: Re: Does FileSplit respect the record boundary?

Thank you for the reply.  That page helps a lot.  I still have a more
specific question.  In a LineRecordReader's constructor  (hadoop 1.0.0)
public LineRecordReader(Configuration job, FileSplit split).  Does a call
"final Path file = split.getPath()" return the logical file in HDFS or just
the real local file cressponding the block in the local file system?  If it
is the previous case, how can we make sure the later call "FSDataInputStream
fileIn = fs.open(split.getPath()); in = new LineReader(fileIn, job);" gives
the block residing in the same local node instead of a replica in the other
node? If it is the later case, ("split.getPath()" giving the local file),
how can we get the inputstream handler to read the next split for an extra
line when reaching the end of the split?  Thanks. 

Zhu, Guojun
Modeling Sr Graduate
571-3824370
guojun_...@freddiemac.com
Financial Engineering
Freddie Mac 

   Harsh J <ha...@cloudera.com> 

   02/10/2012 12:02 PM 

   Please respond to
mapreduce-user@hadoop.apache.org

To

mapreduce-user@hadoop.apache.org 

cc

Subject

Re: Does FileSplit respect the record boundary?

Hi,

Please read the map section of
 <http://wiki.apache.org/hadoop/HadoopMapReduce>
http://wiki.apache.org/hadoop/HadoopMapReduce to understand how Hadoop
ends up respecting record boundaries despite block-chops not taking
that into consideration. I hope it helps clear things up for you.

On Fri, Feb 10, 2012 at 10:26 PM, GUOJUN Zhu <guojun_...@freddiemac.com>
wrote:
>
> Hi,
>
> I am learning Hadoop.  We have some special formated text file for input,
so
> we need to write some customized inputFormat, probably based on
> FileInputFormat.  Does the FileInputFormat respect the record boundary
> (every line or maybe every other line)?  I am reading the source code
> (1.0.0).  For example in the LineRecordReader, is "in" field (InputStream)
> of the LineReader(in,..) the full HDFS file (of many blocks) or just the
> real local file of one block?  All books I read have very little details
> about it.   Can any expert point me to some reference about it, or maybe
> which part of the source code I should concentrate on?  Thanks.
>
> Zhu, Guojun
> Modeling Sr Graduate
> 571-3824370
> guojun_...@freddiemac.com
> Financial Engineering
> Freddie Mac

-- 
Harsh J
Customer Ops. Engineer
Cloudera |  <http://tiny.cloudera.com/about> http://tiny.cloudera.com/about

RE: Does FileSplit respect the record boundary?

Reply via email to