Mohit, On Fri, May 27, 2011 at 10:44 PM, Mohit Anchlia <mohitanch...@gmail.com> wrote: > Actually this link confused me > > http://hadoop.apache.org/common/docs/current/mapred_tutorial.html#Job+Input > > "Clearly, logical splits based on input-size is insufficient for many > applications since record boundaries must be respected. In such cases, > the application should implement a RecordReader, who is responsible > for respecting record-boundaries and presents a record-oriented view > of the logical InputSplit to the individual task." > > But it looks like application doesn't need to do that since it's done > default? Or am I misinterpreting this entirely?
For any type of InputFormat Hadoop provides along with itself, it should already handle this for you (Text Files (say, \n-ended), Sequence Files, Avro Datafiles). If you have a custom file format that defines its own record delimiter character(s); you would surely need to write your own InputFormat that splits across properly (the wiki still helps on how to manage the reads across the first split and the subsequents). -- Harsh J