Hi Zhu,
Ø The LineRecordReader will get the path in the HDFS itself, not on the LocalFileSystem, But its the NameNode who gives the list of DataNodes for a particular block, sorted by the Distance from the Client. i.e. Here Machine where Task is Running. Ø For the line which ends in next block, HDFS only will take care of getting the next block information from NameNode and give it to LineReader. Line Reader will just continue reading without worrying about the location of the block. o One Suggestion to get the better performance is set the split size for the job same as the block size of the input file. If the split size is more than the block size then Task may need to get the block data from multiple datanodes. Thanks and Regards, Vinayakumar B From: GUOJUN Zhu [mailto:guojun_...@freddiemac.com] Sent: Saturday, February 11, 2012 3:50 AM To: mapreduce-user@hadoop.apache.org Subject: Re: Does FileSplit respect the record boundary? Thank you for the reply. That page helps a lot. I still have a more specific question. In a LineRecordReader's constructor (hadoop 1.0.0) public LineRecordReader(Configuration job, FileSplit split). Does a call "final Path file = split.getPath()" return the logical file in HDFS or just the real local file cressponding the block in the local file system? If it is the previous case, how can we make sure the later call "FSDataInputStream fileIn = fs.open(split.getPath()); in = new LineReader(fileIn, job);" gives the block residing in the same local node instead of a replica in the other node? If it is the later case, ("split.getPath()" giving the local file), how can we get the inputstream handler to read the next split for an extra line when reaching the end of the split? Thanks. Zhu, Guojun Modeling Sr Graduate 571-3824370 guojun_...@freddiemac.com Financial Engineering Freddie Mac Harsh J <ha...@cloudera.com> 02/10/2012 12:02 PM Please respond to mapreduce-user@hadoop.apache.org To mapreduce-user@hadoop.apache.org cc Subject Re: Does FileSplit respect the record boundary? Hi, Please read the map section of <http://wiki.apache.org/hadoop/HadoopMapReduce> http://wiki.apache.org/hadoop/HadoopMapReduce to understand how Hadoop ends up respecting record boundaries despite block-chops not taking that into consideration. I hope it helps clear things up for you. On Fri, Feb 10, 2012 at 10:26 PM, GUOJUN Zhu <guojun_...@freddiemac.com> wrote: > > Hi, > > I am learning Hadoop. We have some special formated text file for input, so > we need to write some customized inputFormat, probably based on > FileInputFormat. Does the FileInputFormat respect the record boundary > (every line or maybe every other line)? I am reading the source code > (1.0.0). For example in the LineRecordReader, is "in" field (InputStream) > of the LineReader(in,..) the full HDFS file (of many blocks) or just the > real local file of one block? All books I read have very little details > about it. Can any expert point me to some reference about it, or maybe > which part of the source code I should concentrate on? Thanks. > > Zhu, Guojun > Modeling Sr Graduate > 571-3824370 > guojun_...@freddiemac.com > Financial Engineering > Freddie Mac -- Harsh J Customer Ops. Engineer Cloudera | <http://tiny.cloudera.com/about> http://tiny.cloudera.com/about