Thank you for the reply.  That page helps a lot.  I still have a more 
specific question.  In a LineRecordReader's constructor  (hadoop 1.0.0) 
public LineRecordReader(Configuration job, FileSplit split).  Does a call 
"final Path file = split.getPath()" return the logical file in HDFS or 
just the real local file cressponding the block in the local file system? 
If it is the previous case, how can we make sure the later call "
FSDataInputStream fileIn = fs.open(split.getPath()); in = new 
LineReader(fileIn, job);" gives the block residing in the same local node 
instead of a replica in the other node? If it is the later case, 
("split.getPath()" giving the local file), how can we get the inputstream 
handler to read the next split for an extra line when reaching the end of 
the split?  Thanks. 

Zhu, Guojun
Modeling Sr Graduate
571-3824370
guojun_...@freddiemac.com
Financial Engineering
Freddie Mac



   Harsh J <ha...@cloudera.com> 
   02/10/2012 12:02 PM
   Please respond to
mapreduce-user@hadoop.apache.org


To
mapreduce-user@hadoop.apache.org
cc

Subject
Re: Does FileSplit respect the record boundary?






Hi,

Please read the map section of
http://wiki.apache.org/hadoop/HadoopMapReduce to understand how Hadoop
ends up respecting record boundaries despite block-chops not taking
that into consideration. I hope it helps clear things up for you.

On Fri, Feb 10, 2012 at 10:26 PM, GUOJUN Zhu <guojun_...@freddiemac.com> 
wrote:
>
> Hi,
>
> I am learning Hadoop.  We have some special formated text file for 
input, so
> we need to write some customized inputFormat, probably based on
> FileInputFormat.  Does the FileInputFormat respect the record boundary
> (every line or maybe every other line)?  I am reading the source code
> (1.0.0).  For example in the LineRecordReader, is "in" field 
(InputStream)
> of the LineReader(in,..) the full HDFS file (of many blocks) or just the
> real local file of one block?  All books I read have very little details
> about it.   Can any expert point me to some reference about it, or maybe
> which part of the source code I should concentrate on?  Thanks.
>
> Zhu, Guojun
> Modeling Sr Graduate
> 571-3824370
> guojun_...@freddiemac.com
> Financial Engineering
> Freddie Mac



-- 
Harsh J
Customer Ops. Engineer
Cloudera | http://tiny.cloudera.com/about

Reply via email to