Re: Using own InputSplit
You can add that sometimes the input file is too small and you don't get the desired parallelism. Sent from a remote device. Please excuse any typos... Mike Segel On May 27, 2011, at 12:25 PM, Harsh J ha...@cloudera.com wrote: Mohit, On Fri, May 27, 2011 at 10:44 PM, Mohit Anchlia mohitanch...@gmail.com wrote: Actually this link confused me http://hadoop.apache.org/common/docs/current/mapred_tutorial.html#Job+Input Clearly, logical splits based on input-size is insufficient for many applications since record boundaries must be respected. In such cases, the application should implement a RecordReader, who is responsible for respecting record-boundaries and presents a record-oriented view of the logical InputSplit to the individual task. But it looks like application doesn't need to do that since it's done default? Or am I misinterpreting this entirely? For any type of InputFormat Hadoop provides along with itself, it should already handle this for you (Text Files (say, \n-ended), Sequence Files, Avro Datafiles). If you have a custom file format that defines its own record delimiter character(s); you would surely need to write your own InputFormat that splits across properly (the wiki still helps on how to manage the reads across the first split and the subsequents). -- Harsh J
Using own InputSplit
I am new to hadoop and from what I understand by default hadoop splits the input into blocks. Now this might result in splitting a line of record into 2 pieces and getting spread accross 2 maps. For eg: Line abcd might get split into ab and cd. How can one prevent this in hadoop and pig? I am looking for some examples where I can see how I can specify my own split so that it logically splits based on the record delimiter and not the block size. For some reason I am not able to get right examples online.
Re: Using own InputSplit
Mohit, Please do not cross-post a question to multiple lists unless you're announcing something. What you describe, does not happen; and the way the splitting is done for Text files is explained in good detail here: http://wiki.apache.org/hadoop/HadoopMapReduce Hope this solves your doubt :) On Fri, May 27, 2011 at 10:25 PM, Mohit Anchlia mohitanch...@gmail.com wrote: I am new to hadoop and from what I understand by default hadoop splits the input into blocks. Now this might result in splitting a line of record into 2 pieces and getting spread accross 2 maps. For eg: Line abcd might get split into ab and cd. How can one prevent this in hadoop and pig? I am looking for some examples where I can see how I can specify my own split so that it logically splits based on the record delimiter and not the block size. For some reason I am not able to get right examples online. -- Harsh J
Re: Using own InputSplit
thanks! Just thought it's better to post to multiple groups together since I didn't know where it belongs :) On Fri, May 27, 2011 at 10:04 AM, Harsh J ha...@cloudera.com wrote: Mohit, Please do not cross-post a question to multiple lists unless you're announcing something. What you describe, does not happen; and the way the splitting is done for Text files is explained in good detail here: http://wiki.apache.org/hadoop/HadoopMapReduce Hope this solves your doubt :) On Fri, May 27, 2011 at 10:25 PM, Mohit Anchlia mohitanch...@gmail.com wrote: I am new to hadoop and from what I understand by default hadoop splits the input into blocks. Now this might result in splitting a line of record into 2 pieces and getting spread accross 2 maps. For eg: Line abcd might get split into ab and cd. How can one prevent this in hadoop and pig? I am looking for some examples where I can see how I can specify my own split so that it logically splits based on the record delimiter and not the block size. For some reason I am not able to get right examples online. -- Harsh J
Re: Using own InputSplit
The query fit into mapreduce-user, since it primarily dealt with how Map/Reduce operates over data, just to clarify :) On Fri, May 27, 2011 at 10:38 PM, Mohit Anchlia mohitanch...@gmail.com wrote: thanks! Just thought it's better to post to multiple groups together since I didn't know where it belongs :) On Fri, May 27, 2011 at 10:04 AM, Harsh J ha...@cloudera.com wrote: Mohit, Please do not cross-post a question to multiple lists unless you're announcing something. What you describe, does not happen; and the way the splitting is done for Text files is explained in good detail here: http://wiki.apache.org/hadoop/HadoopMapReduce Hope this solves your doubt :) On Fri, May 27, 2011 at 10:25 PM, Mohit Anchlia mohitanch...@gmail.com wrote: I am new to hadoop and from what I understand by default hadoop splits the input into blocks. Now this might result in splitting a line of record into 2 pieces and getting spread accross 2 maps. For eg: Line abcd might get split into ab and cd. How can one prevent this in hadoop and pig? I am looking for some examples where I can see how I can specify my own split so that it logically splits based on the record delimiter and not the block size. For some reason I am not able to get right examples online. -- Harsh J -- Harsh J
Re: Using own InputSplit
Actually this link confused me http://hadoop.apache.org/common/docs/current/mapred_tutorial.html#Job+Input Clearly, logical splits based on input-size is insufficient for many applications since record boundaries must be respected. In such cases, the application should implement a RecordReader, who is responsible for respecting record-boundaries and presents a record-oriented view of the logical InputSplit to the individual task. But it looks like application doesn't need to do that since it's done default? Or am I misinterpreting this entirely? On Fri, May 27, 2011 at 10:08 AM, Mohit Anchlia mohitanch...@gmail.com wrote: thanks! Just thought it's better to post to multiple groups together since I didn't know where it belongs :) On Fri, May 27, 2011 at 10:04 AM, Harsh J ha...@cloudera.com wrote: Mohit, Please do not cross-post a question to multiple lists unless you're announcing something. What you describe, does not happen; and the way the splitting is done for Text files is explained in good detail here: http://wiki.apache.org/hadoop/HadoopMapReduce Hope this solves your doubt :) On Fri, May 27, 2011 at 10:25 PM, Mohit Anchlia mohitanch...@gmail.com wrote: I am new to hadoop and from what I understand by default hadoop splits the input into blocks. Now this might result in splitting a line of record into 2 pieces and getting spread accross 2 maps. For eg: Line abcd might get split into ab and cd. How can one prevent this in hadoop and pig? I am looking for some examples where I can see how I can specify my own split so that it logically splits based on the record delimiter and not the block size. For some reason I am not able to get right examples online. -- Harsh J
Re: Using own InputSplit
Mohit, On Fri, May 27, 2011 at 10:44 PM, Mohit Anchlia mohitanch...@gmail.com wrote: Actually this link confused me http://hadoop.apache.org/common/docs/current/mapred_tutorial.html#Job+Input Clearly, logical splits based on input-size is insufficient for many applications since record boundaries must be respected. In such cases, the application should implement a RecordReader, who is responsible for respecting record-boundaries and presents a record-oriented view of the logical InputSplit to the individual task. But it looks like application doesn't need to do that since it's done default? Or am I misinterpreting this entirely? For any type of InputFormat Hadoop provides along with itself, it should already handle this for you (Text Files (say, \n-ended), Sequence Files, Avro Datafiles). If you have a custom file format that defines its own record delimiter character(s); you would surely need to write your own InputFormat that splits across properly (the wiki still helps on how to manage the reads across the first split and the subsequents). -- Harsh J