RE: question about file input format
Thanks so much for your help! I'll study the sample code and see what I should do. My mapper will actually invoke another shell process to read in the file and do its job. I just need to get the input file names and pass it to the separate process from my mapper. That case I don't need to read the file to memory right? How should I implement the next function accordingly? Thanks again, Grace -Original Message- From: Harsh J [mailto:ha...@cloudera.com] Sent: Wednesday, August 17, 2011 9:36 PM To: common-dev@hadoop.apache.org Subject: Re: question about file input format Zhixuan, You'll require two things here, as you've deduced correctly: Under InputFormat - isSplitable - False - getRecordReader - A simple implementation that reads the whole file's bytes to an array/your-construct and passes it (as part of next(), etc.). For example, here's a simple record reader impl you can return (untested, but you'll get the idea of reading whole files, and porting to new API is easy as well): https://gist.github.com/1153161 P.s. Since you are reading whole files into memory, keep an eye out for memory usage (the above example has a 10 MB limit per file, for example). You could run out of memory easily if you don't handle the cases properly. On Thu, Aug 18, 2011 at 4:28 AM, Zhixuan Zhu z...@calpont.com wrote: I'm new Hadoop and currently using Hadoop 0.20.2 to try out some simple tasks. I'm trying to send each whole file of the input directory to the mapper without splitting them line by line. How should I set the input format class? I know I could derive a customized FileInputFormat class and override the isSplitable function. But I have no idea how to implement around the record reader. Any suggestion or a sample code will be greatly appreciated. Thanks in advance, Grace -- Harsh J
Re: question about file input format
Grace, In that case you may simply set the key/value with dummy or nulls and return true just once (same unread/read logic applies as in the example). Then, using the input file name (via map.input.file or the inputsplit), pass it to your spawned process and have it do the work. You'll just be omitting the reading under next(). On Thu, Aug 18, 2011 at 7:35 PM, Zhixuan Zhu z...@calpont.com wrote: Thanks so much for your help! I'll study the sample code and see what I should do. My mapper will actually invoke another shell process to read in the file and do its job. I just need to get the input file names and pass it to the separate process from my mapper. That case I don't need to read the file to memory right? How should I implement the next function accordingly? Thanks again, Grace -Original Message- From: Harsh J [mailto:ha...@cloudera.com] Sent: Wednesday, August 17, 2011 9:36 PM To: common-dev@hadoop.apache.org Subject: Re: question about file input format Zhixuan, You'll require two things here, as you've deduced correctly: Under InputFormat - isSplitable - False - getRecordReader - A simple implementation that reads the whole file's bytes to an array/your-construct and passes it (as part of next(), etc.). For example, here's a simple record reader impl you can return (untested, but you'll get the idea of reading whole files, and porting to new API is easy as well): https://gist.github.com/1153161 P.s. Since you are reading whole files into memory, keep an eye out for memory usage (the above example has a 10 MB limit per file, for example). You could run out of memory easily if you don't handle the cases properly. On Thu, Aug 18, 2011 at 4:28 AM, Zhixuan Zhu z...@calpont.com wrote: I'm new Hadoop and currently using Hadoop 0.20.2 to try out some simple tasks. I'm trying to send each whole file of the input directory to the mapper without splitting them line by line. How should I set the input format class? I know I could derive a customized FileInputFormat class and override the isSplitable function. But I have no idea how to implement around the record reader. Any suggestion or a sample code will be greatly appreciated. Thanks in advance, Grace -- Harsh J -- Harsh J
RE: question about file input format
Thanks very much for the prompt reply! It makes perfect sense. I'll give it a try. Grace -Original Message- From: Harsh J [mailto:ha...@cloudera.com] Sent: Thursday, August 18, 2011 10:03 AM To: common-dev@hadoop.apache.org Subject: Re: question about file input format Grace, In that case you may simply set the key/value with dummy or nulls and return true just once (same unread/read logic applies as in the example). Then, using the input file name (via map.input.file or the inputsplit), pass it to your spawned process and have it do the work. You'll just be omitting the reading under next(). On Thu, Aug 18, 2011 at 7:35 PM, Zhixuan Zhu z...@calpont.com wrote: Thanks so much for your help! I'll study the sample code and see what I should do. My mapper will actually invoke another shell process to read in the file and do its job. I just need to get the input file names and pass it to the separate process from my mapper. That case I don't need to read the file to memory right? How should I implement the next function accordingly? Thanks again, Grace -Original Message- From: Harsh J [mailto:ha...@cloudera.com] Sent: Wednesday, August 17, 2011 9:36 PM To: common-dev@hadoop.apache.org Subject: Re: question about file input format Zhixuan, You'll require two things here, as you've deduced correctly: Under InputFormat - isSplitable - False - getRecordReader - A simple implementation that reads the whole file's bytes to an array/your-construct and passes it (as part of next(), etc.). For example, here's a simple record reader impl you can return (untested, but you'll get the idea of reading whole files, and porting to new API is easy as well): https://gist.github.com/1153161 P.s. Since you are reading whole files into memory, keep an eye out for memory usage (the above example has a 10 MB limit per file, for example). You could run out of memory easily if you don't handle the cases properly. On Thu, Aug 18, 2011 at 4:28 AM, Zhixuan Zhu z...@calpont.com wrote: I'm new Hadoop and currently using Hadoop 0.20.2 to try out some simple tasks. I'm trying to send each whole file of the input directory to the mapper without splitting them line by line. How should I set the input format class? I know I could derive a customized FileInputFormat class and override the isSplitable function. But I have no idea how to implement around the record reader. Any suggestion or a sample code will be greatly appreciated. Thanks in advance, Grace -- Harsh J -- Harsh J
question about file input format
I'm new Hadoop and currently using Hadoop 0.20.2 to try out some simple tasks. I'm trying to send each whole file of the input directory to the mapper without splitting them line by line. How should I set the input format class? I know I could derive a customized FileInputFormat class and override the isSplitable function. But I have no idea how to implement around the record reader. Any suggestion or a sample code will be greatly appreciated. Thanks in advance, Grace
Re: question about file input format
What file format do you want to use ? If it's Text or SequenceFile, or any other existing derivative of FileInputFormat, just override isSplittable and rely on the actual RecordReader. Arun On Aug 17, 2011, at 3:58 PM, Zhixuan Zhu wrote: I'm new Hadoop and currently using Hadoop 0.20.2 to try out some simple tasks. I'm trying to send each whole file of the input directory to the mapper without splitting them line by line. How should I set the input format class? I know I could derive a customized FileInputFormat class and override the isSplitable function. But I have no idea how to implement around the record reader. Any suggestion or a sample code will be greatly appreciated. Thanks in advance, Grace