Actually this link confused me http://hadoop.apache.org/common/docs/current/mapred_tutorial.html#Job+Input
"Clearly, logical splits based on input-size is insufficient for many applications since record boundaries must be respected. In such cases, the application should implement a RecordReader, who is responsible for respecting record-boundaries and presents a record-oriented view of the logical InputSplit to the individual task." But it looks like application doesn't need to do that since it's done default? Or am I misinterpreting this entirely? On Fri, May 27, 2011 at 10:08 AM, Mohit Anchlia <mohitanch...@gmail.com> wrote: > thanks! Just thought it's better to post to multiple groups together > since I didn't know where it belongs :) > > On Fri, May 27, 2011 at 10:04 AM, Harsh J <ha...@cloudera.com> wrote: >> Mohit, >> >> Please do not cross-post a question to multiple lists unless you're >> announcing something. >> >> What you describe, does not happen; and the way the splitting is done >> for Text files is explained in good detail here: >> http://wiki.apache.org/hadoop/HadoopMapReduce >> >> Hope this solves your doubt :) >> >> On Fri, May 27, 2011 at 10:25 PM, Mohit Anchlia <mohitanch...@gmail.com> >> wrote: >>> I am new to hadoop and from what I understand by default hadoop splits >>> the input into blocks. Now this might result in splitting a line of >>> record into 2 pieces and getting spread accross 2 maps. For eg: Line >>> "abcd" might get split into "ab" and "cd". How can one prevent this in >>> hadoop and pig? I am looking for some examples where I can see how I >>> can specify my own split so that it logically splits based on the >>> record delimiter and not the block size. For some reason I am not able >>> to get right examples online. >>> >> >> >> >> -- >> Harsh J >> >