Re: A couple of Questions on InputFormat
Hi, Yes, that is right. On Mon, Sep 23, 2013 at 9:04 PM, Steve Lewis wrote: > Thank you for your thorough answer > The last question is essentially this - while I can write a custom input > format to handle things like hyphens I > could do almost the same thing in the mapper by saving any hyphenated words > from the last line (ignoring hyphenated words that > cross a split boundary) as long as LineRecordReader guarantees that each > line in the split is sent to the same mapper in the order read. > This seems to be the case - right? > > > On Mon, Sep 23, 2013 at 4:30 AM, Harsh J wrote: >> >> Hi, >> >> (I'm assuming 1.0~ MR here) >> >> On Sun, Sep 22, 2013 at 1:00 AM, Steve Lewis >> wrote: >> > Classes implementing InputFormat implement >> > public List getSplits(JobContext job) which a List if >> > InputSplits. for FileInputFormat the Splits have Path.start and End >> > >> > 1) When is this method called and on which JVM on Which Machine and is >> > it >> > called only once? >> >> Called only at a client, i.e. your "hadoop jar" JVM. Called only once. >> >> > 2) Do the number of Map task correspond to the number of splits returned >> > by >> > getSplits? >> >> Yes, number of split objects == number of mappers. >> >> > 3) InputFormat implements a method >> > RecordReader createRecordReader(InputSplit >> > split,TaskAttemptContext >> > context ). Is this executed within the JVM of the Mapper on the slave >> > machine and does the RecordReader run within that JVM >> >> RecordReaders are not created on the client side JVM. RecordReaders >> are created on the Map task JVMs, and run inside it. >> >> > 4) The default RecordReaders read a file from the start position to the >> > end >> > position emitting values in the order read. With such a reader, assume >> > it is >> > reading lines of text, is it reasonable to assume that the values the >> > mapper >> > received are in the same order they were found in a file? Would it, for >> > example, be possible for WordCount to see a word that was hyphen- >> > ated at the end of one line and append the first word of the next line >> > it >> > sees (ignoring the case where the word is at the end of a split) >> >> If you speak of the LineRecordReader, each map() will simply read a >> line, i.e. until \n. It is not language-aware to understand meaning of >> hyphens, etc.. >> >> You can implement a custom reader to do this however - there should be >> no problems so long as your logic covers the case of not having any >> duplicate reads across multiple maps. >> >> -- >> Harsh J > > > > > -- > Steven M. Lewis PhD > 4221 105th Ave NE > Kirkland, WA 98033 > 206-384-1340 (cell) > Skype lordjoe_com > -- Harsh J
Re: A couple of Questions on InputFormat
Thank you for your thorough answer The last question is essentially this - while I can write a custom input format to handle things like hyphens I could do almost the same thing in the mapper by saving any hyphenated words from the last line (ignoring hyphenated words that cross a split boundary) as long as LineRecordReader guarantees that each line in the split is sent to the same mapper in the order read. This seems to be the case - right? On Mon, Sep 23, 2013 at 4:30 AM, Harsh J wrote: > Hi, > > (I'm assuming 1.0~ MR here) > > On Sun, Sep 22, 2013 at 1:00 AM, Steve Lewis > wrote: > > Classes implementing InputFormat implement > > public List getSplits(JobContext job) which a List if > > InputSplits. for FileInputFormat the Splits have Path.start and End > > > > 1) When is this method called and on which JVM on Which Machine and is it > > called only once? > > Called only at a client, i.e. your "hadoop jar" JVM. Called only once. > > > 2) Do the number of Map task correspond to the number of splits returned > by > > getSplits? > > Yes, number of split objects == number of mappers. > > > 3) InputFormat implements a method > > RecordReader createRecordReader(InputSplit split,TaskAttemptContext > > context ). Is this executed within the JVM of the Mapper on the slave > > machine and does the RecordReader run within that JVM > > RecordReaders are not created on the client side JVM. RecordReaders > are created on the Map task JVMs, and run inside it. > > > 4) The default RecordReaders read a file from the start position to the > end > > position emitting values in the order read. With such a reader, assume > it is > > reading lines of text, is it reasonable to assume that the values the > mapper > > received are in the same order they were found in a file? Would it, for > > example, be possible for WordCount to see a word that was hyphen- > > ated at the end of one line and append the first word of the next line it > > sees (ignoring the case where the word is at the end of a split) > > If you speak of the LineRecordReader, each map() will simply read a > line, i.e. until \n. It is not language-aware to understand meaning of > hyphens, etc.. > > You can implement a custom reader to do this however - there should be > no problems so long as your logic covers the case of not having any > duplicate reads across multiple maps. > > -- > Harsh J > -- Steven M. Lewis PhD 4221 105th Ave NE Kirkland, WA 98033 206-384-1340 (cell) Skype lordjoe_com
Re: A couple of Questions on InputFormat
Hi, (I'm assuming 1.0~ MR here) On Sun, Sep 22, 2013 at 1:00 AM, Steve Lewis wrote: > Classes implementing InputFormat implement > public List getSplits(JobContext job) which a List if > InputSplits. for FileInputFormat the Splits have Path.start and End > > 1) When is this method called and on which JVM on Which Machine and is it > called only once? Called only at a client, i.e. your "hadoop jar" JVM. Called only once. > 2) Do the number of Map task correspond to the number of splits returned by > getSplits? Yes, number of split objects == number of mappers. > 3) InputFormat implements a method > RecordReader createRecordReader(InputSplit split,TaskAttemptContext > context ). Is this executed within the JVM of the Mapper on the slave > machine and does the RecordReader run within that JVM RecordReaders are not created on the client side JVM. RecordReaders are created on the Map task JVMs, and run inside it. > 4) The default RecordReaders read a file from the start position to the end > position emitting values in the order read. With such a reader, assume it is > reading lines of text, is it reasonable to assume that the values the mapper > received are in the same order they were found in a file? Would it, for > example, be possible for WordCount to see a word that was hyphen- > ated at the end of one line and append the first word of the next line it > sees (ignoring the case where the word is at the end of a split) If you speak of the LineRecordReader, each map() will simply read a line, i.e. until \n. It is not language-aware to understand meaning of hyphens, etc.. You can implement a custom reader to do this however - there should be no problems so long as your logic covers the case of not having any duplicate reads across multiple maps. -- Harsh J