RE: question about file input format

Zhixuan Zhu Thu, 18 Aug 2011 07:06:22 -0700

Thanks so much for your help! I'll study the sample code and see what I
should do. My mapper will actually invoke another shell process to read
in the file and do its job. I just need to get the input file names and
pass it to the separate process from my mapper. That case I don't need
to read the file to memory right? How should I implement the next
function accordingly?


Thanks again,
Grace

-----Original Message-----
From: Harsh J [mailto:ha...@cloudera.com] 
Sent: Wednesday, August 17, 2011 9:36 PM
To: common-dev@hadoop.apache.org
Subject: Re: question about file input format

Zhixuan,

You'll require two things here, as you've deduced correctly:

Under InputFormat
- isSplitable -> False
- getRecordReader -> A simple implementation that reads the whole
file's bytes to an array/your-construct and passes it (as part of
next(), etc.).

For example, here's a simple record reader impl you can return
(untested, but you'll get the idea of reading whole files, and porting
to new API is easy as well): https://gist.github.com/1153161

P.s. Since you are reading whole files into memory, keep an eye out
for memory usage (the above example has a 10 MB limit per file, for
example). You could run out of memory easily if you don't handle the
cases properly.

On Thu, Aug 18, 2011 at 4:28 AM, Zhixuan Zhu <z...@calpont.com> wrote:
> I'm new Hadoop and currently using Hadoop 0.20.2 to try out some
simple
> tasks. I'm trying to send each whole file of the input directory to
the
> mapper without splitting them line by line. How should I set the input
> format class? I know I could derive a customized FileInputFormat class
> and override the isSplitable function. But I have no idea how to
> implement around the record reader. Any suggestion or a sample code
will
> be greatly appreciated.
>
> Thanks in advance,
> Grace
>



-- 
Harsh J

RE: question about file input format

Reply via email to