Re: A couple of Questions on InputFormat

2013-09-23 Thread Harsh J
Hi,

Yes, that is right.

On Mon, Sep 23, 2013 at 9:04 PM, Steve Lewis  wrote:
> Thank you for your thorough answer
> The last question is essentially this - while I can write a custom input
> format to handle things like hyphens I
> could do almost the same thing in the mapper by saving any hyphenated words
> from the last line (ignoring hyphenated words that
> cross a split boundary) as long as  LineRecordReader guarantees that each
> line in the split is sent to the same mapper in the order read.
> This seems to be the case - right?
>
>
> On Mon, Sep 23, 2013 at 4:30 AM, Harsh J  wrote:
>>
>> Hi,
>>
>> (I'm assuming 1.0~ MR here)
>>
>> On Sun, Sep 22, 2013 at 1:00 AM, Steve Lewis 
>> wrote:
>> > Classes implementing InputFormat implement
>> >  public List getSplits(JobContext job) which a List if
>> > InputSplits. for FileInputFormat the Splits have Path.start and End
>> >
>> > 1) When is this method called and on which JVM on Which Machine and is
>> > it
>> > called only once?
>>
>> Called only at a client, i.e. your "hadoop jar" JVM. Called only once.
>>
>> > 2) Do the number of Map task correspond to the number of splits returned
>> > by
>> > getSplits?
>>
>> Yes, number of split objects == number of mappers.
>>
>> > 3) InputFormat implements a method
>> >  RecordReader createRecordReader(InputSplit
>> > split,TaskAttemptContext
>> > context ). Is this  executed within the JVM of the Mapper on the slave
>> > machine and does the RecordReader run within that JVM
>>
>> RecordReaders are not created on the client side JVM. RecordReaders
>> are created on the Map task JVMs, and run inside it.
>>
>> > 4) The default RecordReaders read a file from the start position to the
>> > end
>> > position emitting values in the order read. With such a reader, assume
>> > it is
>> > reading lines of text, is it reasonable to assume that the values the
>> > mapper
>> > received are in the same order they were found in a file? Would it, for
>> > example, be possible for WordCount to see a word that was hyphen-
>> > ated at the end of one line and append the first word of the next line
>> > it
>> > sees (ignoring the case where the word is at the end of a split)
>>
>> If you speak of the LineRecordReader, each map() will simply read a
>> line, i.e. until \n. It is not language-aware to understand meaning of
>> hyphens, etc..
>>
>> You can implement a custom reader to do this however - there should be
>> no problems so long as your logic covers the case of not having any
>> duplicate reads across multiple maps.
>>
>> --
>> Harsh J
>
>
>
>
> --
> Steven M. Lewis PhD
> 4221 105th Ave NE
> Kirkland, WA 98033
> 206-384-1340 (cell)
> Skype lordjoe_com
>



-- 
Harsh J


Re: A couple of Questions on InputFormat

2013-09-23 Thread Steve Lewis
Thank you for your thorough answer
The last question is essentially this - while I can write a custom input
format to handle things like hyphens I
could do almost the same thing in the mapper by saving any hyphenated words
from the last line (ignoring hyphenated words that
cross a split boundary) as long as  LineRecordReader guarantees that each
line in the split is sent to the same mapper in the order read.
This seems to be the case - right?


On Mon, Sep 23, 2013 at 4:30 AM, Harsh J  wrote:

> Hi,
>
> (I'm assuming 1.0~ MR here)
>
> On Sun, Sep 22, 2013 at 1:00 AM, Steve Lewis 
> wrote:
> > Classes implementing InputFormat implement
> >  public List getSplits(JobContext job) which a List if
> > InputSplits. for FileInputFormat the Splits have Path.start and End
> >
> > 1) When is this method called and on which JVM on Which Machine and is it
> > called only once?
>
> Called only at a client, i.e. your "hadoop jar" JVM. Called only once.
>
> > 2) Do the number of Map task correspond to the number of splits returned
> by
> > getSplits?
>
> Yes, number of split objects == number of mappers.
>
> > 3) InputFormat implements a method
> >  RecordReader createRecordReader(InputSplit split,TaskAttemptContext
> > context ). Is this  executed within the JVM of the Mapper on the slave
> > machine and does the RecordReader run within that JVM
>
> RecordReaders are not created on the client side JVM. RecordReaders
> are created on the Map task JVMs, and run inside it.
>
> > 4) The default RecordReaders read a file from the start position to the
> end
> > position emitting values in the order read. With such a reader, assume
> it is
> > reading lines of text, is it reasonable to assume that the values the
> mapper
> > received are in the same order they were found in a file? Would it, for
> > example, be possible for WordCount to see a word that was hyphen-
> > ated at the end of one line and append the first word of the next line it
> > sees (ignoring the case where the word is at the end of a split)
>
> If you speak of the LineRecordReader, each map() will simply read a
> line, i.e. until \n. It is not language-aware to understand meaning of
> hyphens, etc..
>
> You can implement a custom reader to do this however - there should be
> no problems so long as your logic covers the case of not having any
> duplicate reads across multiple maps.
>
> --
> Harsh J
>



-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com


Re: A couple of Questions on InputFormat

2013-09-23 Thread Harsh J
Hi,

(I'm assuming 1.0~ MR here)

On Sun, Sep 22, 2013 at 1:00 AM, Steve Lewis  wrote:
> Classes implementing InputFormat implement
>  public List getSplits(JobContext job) which a List if
> InputSplits. for FileInputFormat the Splits have Path.start and End
>
> 1) When is this method called and on which JVM on Which Machine and is it
> called only once?

Called only at a client, i.e. your "hadoop jar" JVM. Called only once.

> 2) Do the number of Map task correspond to the number of splits returned by
> getSplits?

Yes, number of split objects == number of mappers.

> 3) InputFormat implements a method
>  RecordReader createRecordReader(InputSplit split,TaskAttemptContext
> context ). Is this  executed within the JVM of the Mapper on the slave
> machine and does the RecordReader run within that JVM

RecordReaders are not created on the client side JVM. RecordReaders
are created on the Map task JVMs, and run inside it.

> 4) The default RecordReaders read a file from the start position to the end
> position emitting values in the order read. With such a reader, assume it is
> reading lines of text, is it reasonable to assume that the values the mapper
> received are in the same order they were found in a file? Would it, for
> example, be possible for WordCount to see a word that was hyphen-
> ated at the end of one line and append the first word of the next line it
> sees (ignoring the case where the word is at the end of a split)

If you speak of the LineRecordReader, each map() will simply read a
line, i.e. until \n. It is not language-aware to understand meaning of
hyphens, etc..

You can implement a custom reader to do this however - there should be
no problems so long as your logic covers the case of not having any
duplicate reads across multiple maps.

-- 
Harsh J