Re: Using own InputSplit

Mohit Anchlia Fri, 27 May 2011 10:15:28 -0700

Actually this link confused me

http://hadoop.apache.org/common/docs/current/mapred_tutorial.html#Job+Input


"Clearly, logical splits based on input-size is insufficient for many
applications since record boundaries must be respected. In such cases,
the application should implement a RecordReader, who is responsible
for respecting record-boundaries and presents a record-oriented view
of the logical InputSplit to the individual task."

But it looks like application doesn't need to do that since it's done
default? Or am I misinterpreting this entirely?

On Fri, May 27, 2011 at 10:08 AM, Mohit Anchlia <mohitanch...@gmail.com> wrote:
> thanks! Just thought it's better to post to multiple groups together
> since I didn't know where it belongs :)
>
> On Fri, May 27, 2011 at 10:04 AM, Harsh J <ha...@cloudera.com> wrote:
>> Mohit,
>>
>> Please do not cross-post a question to multiple lists unless you're
>> announcing something.
>>
>> What you describe, does not happen; and the way the splitting is done
>> for Text files is explained in good detail here:
>> http://wiki.apache.org/hadoop/HadoopMapReduce
>>
>> Hope this solves your doubt :)
>>
>> On Fri, May 27, 2011 at 10:25 PM, Mohit Anchlia <mohitanch...@gmail.com> 
>> wrote:
>>> I am new to hadoop and from what I understand by default hadoop splits
>>> the input into blocks. Now this might result in splitting a line of
>>> record into 2 pieces and getting spread accross 2 maps. For eg: Line
>>> "abcd" might get split into "ab" and "cd". How can one prevent this in
>>> hadoop and pig? I am looking for some examples where I can see how I
>>> can specify my own split so that it logically splits based on the
>>> record delimiter and not the block size. For some reason I am not able
>>> to get right examples online.
>>>
>>
>>
>>
>> --
>> Harsh J
>>
>

Re: Using own InputSplit

Reply via email to