Re: isSplitable() problem
The current code guarantees that they will be received in order. There some patches that are likely to go in soon that would allow for the JVM itself to be reused. In those cases I believe that the mapper class would be recreated, so the only thing you would have to worry about would be static values that are updated while processing the data. -- Bobby Evans On 4/24/12 4:45 AM, "Dan Drew" wrote: I have chosen to use Jay's suggestion as a quick workaround and am pleased to report that it seems to work well on small test inputs. My question now is, are the mappers guaranteed to receive the file's lines in order? Browsing the source suggests this is so, but I just want to make sure as my understanding of Hadoop is transubstantial. Thank you for your patience in answering my questions. On 23 April 2012 14:28, Harsh J wrote: > Jay, > > On Mon, Apr 23, 2012 at 6:43 PM, JAX wrote: > > Curious : Seems like you could aggregate the results in the mapper as a > local variable or list of strings--- is there a way to know that your > mapper has just read the LAST line of an input split? > > True. Can be one way to do it (unless aggregation of 'records' needs > to happen live, and you don't wish to store it all in memory). > > > Is there a "cleanup" or "finalize" method in mappers that is run at the > end of a whole steam read to support these sort of chunked, in memor map/r > operations? > > Yes there is. See: > > Old API: > http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Mapper.html > (See Closeable's close()) > > New API: > http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/Mapper.html#cleanup(org.apache.hadoop.mapreduce.Mapper.Context) > > > -- > Harsh J >
Re: isSplitable() problem
I have chosen to use Jay's suggestion as a quick workaround and am pleased to report that it seems to work well on small test inputs. My question now is, are the mappers guaranteed to receive the file's lines in order? Browsing the source suggests this is so, but I just want to make sure as my understanding of Hadoop is transubstantial. Thank you for your patience in answering my questions. On 23 April 2012 14:28, Harsh J wrote: > Jay, > > On Mon, Apr 23, 2012 at 6:43 PM, JAX wrote: > > Curious : Seems like you could aggregate the results in the mapper as a > local variable or list of strings--- is there a way to know that your > mapper has just read the LAST line of an input split? > > True. Can be one way to do it (unless aggregation of 'records' needs > to happen live, and you don't wish to store it all in memory). > > > Is there a "cleanup" or "finalize" method in mappers that is run at the > end of a whole steam read to support these sort of chunked, in memor map/r > operations? > > Yes there is. See: > > Old API: > http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Mapper.html > (See Closeable's close()) > > New API: > http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/Mapper.html#cleanup(org.apache.hadoop.mapreduce.Mapper.Context) > > > -- > Harsh J >
Re: isSplitable() problem
Jay, On Mon, Apr 23, 2012 at 6:43 PM, JAX wrote: > Curious : Seems like you could aggregate the results in the mapper as a local > variable or list of strings--- is there a way to know that your mapper has > just read the LAST line of an input split? True. Can be one way to do it (unless aggregation of 'records' needs to happen live, and you don't wish to store it all in memory). > Is there a "cleanup" or "finalize" method in mappers that is run at the end > of a whole steam read to support these sort of chunked, in memor map/r > operations? Yes there is. See: Old API: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Mapper.html (See Closeable's close()) New API: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/Mapper.html#cleanup(org.apache.hadoop.mapreduce.Mapper.Context) -- Harsh J
Re: isSplitable() problem
Curious : Seems like you could aggregate the results in the mapper as a local variable or list of strings--- is there a way to know that your mapper has just read the LAST line of an input split? I.e if so, then you could implement your entire solution in your mapper without needing a new input format z? Is there a "cleanup" or "finalize" method in mappers that is run at the end of a whole steam read to support these sort of chunked, in memor map/r operations? Jay Vyas MMSB UCHC On Apr 23, 2012, at 6:40 AM, Dan Drew wrote: > I require each input file to be processed by each mapper as a whole. > > I subclass c.o.a.h.mapreduce.lib.input.TextInputFormat and override > isSplitable() to invariably return false. > > The job is configured to use this subclass as the input format class via > setInputFormatClass(). The job runs without error, yet the logs reveal > files are still processed line by line by the mappers. > > Any help would be greatly appreciated, > Thanks
Re: isSplitable() problem
Thanks for the clarification. On 23 April 2012 12:52, Harsh J wrote: > Dan, > > Split and reading a whole file as a chunk are two slightly different > things. The former controls if your files ought to be split across > mappers (useful if there are multiple blocks of file in HDFS). The > latter needs to be achieved differently. > > The TextInputFormat provides by default a LineRecordReader, which as > it name goes - reads whatever stream is provided to it line-by-line. > This is regardless of the file's block splits (a very different thing > than line splits). > > You need to implement your own "RecordReader" and return it from your > InputFormat to do what you want it to - i.e. read the whole stream > into an object and then pass it out to the Mapper. > > On Mon, Apr 23, 2012 at 5:10 PM, Dan Drew > wrote: > > I require each input file to be processed by each mapper as a whole. > > > > I subclass c.o.a.h.mapreduce.lib.input.TextInputFormat and override > > isSplitable() to invariably return false. > > > > The job is configured to use this subclass as the input format class via > > setInputFormatClass(). The job runs without error, yet the logs reveal > > files are still processed line by line by the mappers. > > > > Any help would be greatly appreciated, > > Thanks > > > > -- > Harsh J >
Re: isSplitable() problem
Dan, Split and reading a whole file as a chunk are two slightly different things. The former controls if your files ought to be split across mappers (useful if there are multiple blocks of file in HDFS). The latter needs to be achieved differently. The TextInputFormat provides by default a LineRecordReader, which as it name goes - reads whatever stream is provided to it line-by-line. This is regardless of the file's block splits (a very different thing than line splits). You need to implement your own "RecordReader" and return it from your InputFormat to do what you want it to - i.e. read the whole stream into an object and then pass it out to the Mapper. On Mon, Apr 23, 2012 at 5:10 PM, Dan Drew wrote: > I require each input file to be processed by each mapper as a whole. > > I subclass c.o.a.h.mapreduce.lib.input.TextInputFormat and override > isSplitable() to invariably return false. > > The job is configured to use this subclass as the input format class via > setInputFormatClass(). The job runs without error, yet the logs reveal > files are still processed line by line by the mappers. > > Any help would be greatly appreciated, > Thanks -- Harsh J
isSplitable() problem
I require each input file to be processed by each mapper as a whole. I subclass c.o.a.h.mapreduce.lib.input.TextInputFormat and override isSplitable() to invariably return false. The job is configured to use this subclass as the input format class via setInputFormatClass(). The job runs without error, yet the logs reveal files are still processed line by line by the mappers. Any help would be greatly appreciated, Thanks