Re: Using own InputSplit

2011-05-29 Thread Michel Segel
You can add that sometimes the input file is too small and you don't get the 
desired parallelism. 

Sent from a remote device. Please excuse any typos...

Mike Segel

On May 27, 2011, at 12:25 PM, Harsh J ha...@cloudera.com wrote:

 Mohit,
 
 On Fri, May 27, 2011 at 10:44 PM, Mohit Anchlia mohitanch...@gmail.com 
 wrote:
 Actually this link confused me
 
 http://hadoop.apache.org/common/docs/current/mapred_tutorial.html#Job+Input
 
 Clearly, logical splits based on input-size is insufficient for many
 applications since record boundaries must be respected. In such cases,
 the application should implement a RecordReader, who is responsible
 for respecting record-boundaries and presents a record-oriented view
 of the logical InputSplit to the individual task.
 
 But it looks like application doesn't need to do that since it's done
 default? Or am I misinterpreting this entirely?
 
 For any type of InputFormat Hadoop provides along with itself, it
 should already handle this for you (Text Files (say, \n-ended),
 Sequence Files, Avro Datafiles). If you have a custom file format that
 defines its own record delimiter character(s); you would surely need
 to write your own InputFormat that splits across properly (the wiki
 still helps on how to manage the reads across the first split and the
 subsequents).
 
 -- 
 Harsh J
 


Using own InputSplit

2011-05-27 Thread Mohit Anchlia
I am new to hadoop and from what I understand by default hadoop splits
the input into blocks. Now this might result in splitting a line of
record into 2 pieces and getting spread accross 2 maps. For eg: Line
abcd might get split into ab and cd. How can one prevent this in
hadoop and pig? I am looking for some examples where I can see how I
can specify my own split so that it logically splits based on the
record delimiter and not the block size. For some reason I am not able
to get right examples online.


Re: Using own InputSplit

2011-05-27 Thread Harsh J
Mohit,

Please do not cross-post a question to multiple lists unless you're
announcing something.

What you describe, does not happen; and the way the splitting is done
for Text files is explained in good detail here:
http://wiki.apache.org/hadoop/HadoopMapReduce

Hope this solves your doubt :)

On Fri, May 27, 2011 at 10:25 PM, Mohit Anchlia mohitanch...@gmail.com wrote:
 I am new to hadoop and from what I understand by default hadoop splits
 the input into blocks. Now this might result in splitting a line of
 record into 2 pieces and getting spread accross 2 maps. For eg: Line
 abcd might get split into ab and cd. How can one prevent this in
 hadoop and pig? I am looking for some examples where I can see how I
 can specify my own split so that it logically splits based on the
 record delimiter and not the block size. For some reason I am not able
 to get right examples online.




-- 
Harsh J


Re: Using own InputSplit

2011-05-27 Thread Mohit Anchlia
thanks! Just thought it's better to post to multiple groups together
since I didn't know where it belongs :)

On Fri, May 27, 2011 at 10:04 AM, Harsh J ha...@cloudera.com wrote:
 Mohit,

 Please do not cross-post a question to multiple lists unless you're
 announcing something.

 What you describe, does not happen; and the way the splitting is done
 for Text files is explained in good detail here:
 http://wiki.apache.org/hadoop/HadoopMapReduce

 Hope this solves your doubt :)

 On Fri, May 27, 2011 at 10:25 PM, Mohit Anchlia mohitanch...@gmail.com 
 wrote:
 I am new to hadoop and from what I understand by default hadoop splits
 the input into blocks. Now this might result in splitting a line of
 record into 2 pieces and getting spread accross 2 maps. For eg: Line
 abcd might get split into ab and cd. How can one prevent this in
 hadoop and pig? I am looking for some examples where I can see how I
 can specify my own split so that it logically splits based on the
 record delimiter and not the block size. For some reason I am not able
 to get right examples online.




 --
 Harsh J



Re: Using own InputSplit

2011-05-27 Thread Harsh J
The query fit into mapreduce-user, since it primarily dealt with how
Map/Reduce operates over data, just to clarify :)

On Fri, May 27, 2011 at 10:38 PM, Mohit Anchlia mohitanch...@gmail.com wrote:
 thanks! Just thought it's better to post to multiple groups together
 since I didn't know where it belongs :)

 On Fri, May 27, 2011 at 10:04 AM, Harsh J ha...@cloudera.com wrote:
 Mohit,

 Please do not cross-post a question to multiple lists unless you're
 announcing something.

 What you describe, does not happen; and the way the splitting is done
 for Text files is explained in good detail here:
 http://wiki.apache.org/hadoop/HadoopMapReduce

 Hope this solves your doubt :)

 On Fri, May 27, 2011 at 10:25 PM, Mohit Anchlia mohitanch...@gmail.com 
 wrote:
 I am new to hadoop and from what I understand by default hadoop splits
 the input into blocks. Now this might result in splitting a line of
 record into 2 pieces and getting spread accross 2 maps. For eg: Line
 abcd might get split into ab and cd. How can one prevent this in
 hadoop and pig? I am looking for some examples where I can see how I
 can specify my own split so that it logically splits based on the
 record delimiter and not the block size. For some reason I am not able
 to get right examples online.




 --
 Harsh J





-- 
Harsh J


Re: Using own InputSplit

2011-05-27 Thread Mohit Anchlia
Actually this link confused me

http://hadoop.apache.org/common/docs/current/mapred_tutorial.html#Job+Input

Clearly, logical splits based on input-size is insufficient for many
applications since record boundaries must be respected. In such cases,
the application should implement a RecordReader, who is responsible
for respecting record-boundaries and presents a record-oriented view
of the logical InputSplit to the individual task.

But it looks like application doesn't need to do that since it's done
default? Or am I misinterpreting this entirely?

On Fri, May 27, 2011 at 10:08 AM, Mohit Anchlia mohitanch...@gmail.com wrote:
 thanks! Just thought it's better to post to multiple groups together
 since I didn't know where it belongs :)

 On Fri, May 27, 2011 at 10:04 AM, Harsh J ha...@cloudera.com wrote:
 Mohit,

 Please do not cross-post a question to multiple lists unless you're
 announcing something.

 What you describe, does not happen; and the way the splitting is done
 for Text files is explained in good detail here:
 http://wiki.apache.org/hadoop/HadoopMapReduce

 Hope this solves your doubt :)

 On Fri, May 27, 2011 at 10:25 PM, Mohit Anchlia mohitanch...@gmail.com 
 wrote:
 I am new to hadoop and from what I understand by default hadoop splits
 the input into blocks. Now this might result in splitting a line of
 record into 2 pieces and getting spread accross 2 maps. For eg: Line
 abcd might get split into ab and cd. How can one prevent this in
 hadoop and pig? I am looking for some examples where I can see how I
 can specify my own split so that it logically splits based on the
 record delimiter and not the block size. For some reason I am not able
 to get right examples online.




 --
 Harsh J




Re: Using own InputSplit

2011-05-27 Thread Harsh J
Mohit,

On Fri, May 27, 2011 at 10:44 PM, Mohit Anchlia mohitanch...@gmail.com wrote:
 Actually this link confused me

 http://hadoop.apache.org/common/docs/current/mapred_tutorial.html#Job+Input

 Clearly, logical splits based on input-size is insufficient for many
 applications since record boundaries must be respected. In such cases,
 the application should implement a RecordReader, who is responsible
 for respecting record-boundaries and presents a record-oriented view
 of the logical InputSplit to the individual task.

 But it looks like application doesn't need to do that since it's done
 default? Or am I misinterpreting this entirely?

For any type of InputFormat Hadoop provides along with itself, it
should already handle this for you (Text Files (say, \n-ended),
Sequence Files, Avro Datafiles). If you have a custom file format that
defines its own record delimiter character(s); you would surely need
to write your own InputFormat that splits across properly (the wiki
still helps on how to manage the reads across the first split and the
subsequents).

-- 
Harsh J