Thanks so much for your help! I'll study the sample code and see what I should do. My mapper will actually invoke another shell process to read in the file and do its job. I just need to get the input file names and pass it to the separate process from my mapper. That case I don't need to read the file to memory right? How should I implement the next function accordingly?
Thanks again, Grace -----Original Message----- From: Harsh J [mailto:ha...@cloudera.com] Sent: Wednesday, August 17, 2011 9:36 PM To: common-dev@hadoop.apache.org Subject: Re: question about file input format Zhixuan, You'll require two things here, as you've deduced correctly: Under InputFormat - isSplitable -> False - getRecordReader -> A simple implementation that reads the whole file's bytes to an array/your-construct and passes it (as part of next(), etc.). For example, here's a simple record reader impl you can return (untested, but you'll get the idea of reading whole files, and porting to new API is easy as well): https://gist.github.com/1153161 P.s. Since you are reading whole files into memory, keep an eye out for memory usage (the above example has a 10 MB limit per file, for example). You could run out of memory easily if you don't handle the cases properly. On Thu, Aug 18, 2011 at 4:28 AM, Zhixuan Zhu <z...@calpont.com> wrote: > I'm new Hadoop and currently using Hadoop 0.20.2 to try out some simple > tasks. I'm trying to send each whole file of the input directory to the > mapper without splitting them line by line. How should I set the input > format class? I know I could derive a customized FileInputFormat class > and override the isSplitable function. But I have no idea how to > implement around the record reader. Any suggestion or a sample code will > be greatly appreciated. > > Thanks in advance, > Grace > -- Harsh J