RE: question about file input format

2011-08-18 Thread Zhixuan Zhu
Thanks so much for your help! I'll study the sample code and see what I
should do. My mapper will actually invoke another shell process to read
in the file and do its job. I just need to get the input file names and
pass it to the separate process from my mapper. That case I don't need
to read the file to memory right? How should I implement the next
function accordingly?

Thanks again,
Grace

-Original Message-
From: Harsh J [mailto:ha...@cloudera.com] 
Sent: Wednesday, August 17, 2011 9:36 PM
To: common-dev@hadoop.apache.org
Subject: Re: question about file input format

Zhixuan,

You'll require two things here, as you've deduced correctly:

Under InputFormat
- isSplitable - False
- getRecordReader - A simple implementation that reads the whole
file's bytes to an array/your-construct and passes it (as part of
next(), etc.).

For example, here's a simple record reader impl you can return
(untested, but you'll get the idea of reading whole files, and porting
to new API is easy as well): https://gist.github.com/1153161

P.s. Since you are reading whole files into memory, keep an eye out
for memory usage (the above example has a 10 MB limit per file, for
example). You could run out of memory easily if you don't handle the
cases properly.

On Thu, Aug 18, 2011 at 4:28 AM, Zhixuan Zhu z...@calpont.com wrote:
 I'm new Hadoop and currently using Hadoop 0.20.2 to try out some
simple
 tasks. I'm trying to send each whole file of the input directory to
the
 mapper without splitting them line by line. How should I set the input
 format class? I know I could derive a customized FileInputFormat class
 and override the isSplitable function. But I have no idea how to
 implement around the record reader. Any suggestion or a sample code
will
 be greatly appreciated.

 Thanks in advance,
 Grace




-- 
Harsh J


Re: question about file input format

2011-08-18 Thread Harsh J
Grace,

In that case you may simply set the key/value with dummy or nulls and
return true just once (same unread/read logic applies as in the
example). Then, using the input file name (via map.input.file or the
inputsplit), pass it to your spawned process and have it do the work.
You'll just be omitting the reading under next().

On Thu, Aug 18, 2011 at 7:35 PM, Zhixuan Zhu z...@calpont.com wrote:
 Thanks so much for your help! I'll study the sample code and see what I
 should do. My mapper will actually invoke another shell process to read
 in the file and do its job. I just need to get the input file names and
 pass it to the separate process from my mapper. That case I don't need
 to read the file to memory right? How should I implement the next
 function accordingly?

 Thanks again,
 Grace

 -Original Message-
 From: Harsh J [mailto:ha...@cloudera.com]
 Sent: Wednesday, August 17, 2011 9:36 PM
 To: common-dev@hadoop.apache.org
 Subject: Re: question about file input format

 Zhixuan,

 You'll require two things here, as you've deduced correctly:

 Under InputFormat
 - isSplitable - False
 - getRecordReader - A simple implementation that reads the whole
 file's bytes to an array/your-construct and passes it (as part of
 next(), etc.).

 For example, here's a simple record reader impl you can return
 (untested, but you'll get the idea of reading whole files, and porting
 to new API is easy as well): https://gist.github.com/1153161

 P.s. Since you are reading whole files into memory, keep an eye out
 for memory usage (the above example has a 10 MB limit per file, for
 example). You could run out of memory easily if you don't handle the
 cases properly.

 On Thu, Aug 18, 2011 at 4:28 AM, Zhixuan Zhu z...@calpont.com wrote:
 I'm new Hadoop and currently using Hadoop 0.20.2 to try out some
 simple
 tasks. I'm trying to send each whole file of the input directory to
 the
 mapper without splitting them line by line. How should I set the input
 format class? I know I could derive a customized FileInputFormat class
 and override the isSplitable function. But I have no idea how to
 implement around the record reader. Any suggestion or a sample code
 will
 be greatly appreciated.

 Thanks in advance,
 Grace




 --
 Harsh J




-- 
Harsh J


RE: question about file input format

2011-08-18 Thread Zhixuan Zhu

Thanks very much for the prompt reply! It makes perfect sense. I'll give
it a try.

Grace

-Original Message-
From: Harsh J [mailto:ha...@cloudera.com] 
Sent: Thursday, August 18, 2011 10:03 AM
To: common-dev@hadoop.apache.org
Subject: Re: question about file input format

Grace,

In that case you may simply set the key/value with dummy or nulls and
return true just once (same unread/read logic applies as in the
example). Then, using the input file name (via map.input.file or the
inputsplit), pass it to your spawned process and have it do the work.
You'll just be omitting the reading under next().

On Thu, Aug 18, 2011 at 7:35 PM, Zhixuan Zhu z...@calpont.com wrote:
 Thanks so much for your help! I'll study the sample code and see what
I
 should do. My mapper will actually invoke another shell process to
read
 in the file and do its job. I just need to get the input file names
and
 pass it to the separate process from my mapper. That case I don't need
 to read the file to memory right? How should I implement the next
 function accordingly?

 Thanks again,
 Grace

 -Original Message-
 From: Harsh J [mailto:ha...@cloudera.com]
 Sent: Wednesday, August 17, 2011 9:36 PM
 To: common-dev@hadoop.apache.org
 Subject: Re: question about file input format

 Zhixuan,

 You'll require two things here, as you've deduced correctly:

 Under InputFormat
 - isSplitable - False
 - getRecordReader - A simple implementation that reads the whole
 file's bytes to an array/your-construct and passes it (as part of
 next(), etc.).

 For example, here's a simple record reader impl you can return
 (untested, but you'll get the idea of reading whole files, and porting
 to new API is easy as well): https://gist.github.com/1153161

 P.s. Since you are reading whole files into memory, keep an eye out
 for memory usage (the above example has a 10 MB limit per file, for
 example). You could run out of memory easily if you don't handle the
 cases properly.

 On Thu, Aug 18, 2011 at 4:28 AM, Zhixuan Zhu z...@calpont.com wrote:
 I'm new Hadoop and currently using Hadoop 0.20.2 to try out some
 simple
 tasks. I'm trying to send each whole file of the input directory to
 the
 mapper without splitting them line by line. How should I set the
input
 format class? I know I could derive a customized FileInputFormat
class
 and override the isSplitable function. But I have no idea how to
 implement around the record reader. Any suggestion or a sample code
 will
 be greatly appreciated.

 Thanks in advance,
 Grace




 --
 Harsh J




-- 
Harsh J


question about file input format

2011-08-17 Thread Zhixuan Zhu
I'm new Hadoop and currently using Hadoop 0.20.2 to try out some simple
tasks. I'm trying to send each whole file of the input directory to the
mapper without splitting them line by line. How should I set the input
format class? I know I could derive a customized FileInputFormat class
and override the isSplitable function. But I have no idea how to
implement around the record reader. Any suggestion or a sample code will
be greatly appreciated. 

Thanks in advance,
Grace


Re: question about file input format

2011-08-17 Thread Arun C Murthy
What file format do you want to use ?

If it's Text or SequenceFile, or any other existing derivative of 
FileInputFormat, just override isSplittable and rely on the actual RecordReader.

Arun

On Aug 17, 2011, at 3:58 PM, Zhixuan Zhu wrote:

 I'm new Hadoop and currently using Hadoop 0.20.2 to try out some simple
 tasks. I'm trying to send each whole file of the input directory to the
 mapper without splitting them line by line. How should I set the input
 format class? I know I could derive a customized FileInputFormat class
 and override the isSplitable function. But I have no idea how to
 implement around the record reader. Any suggestion or a sample code will
 be greatly appreciated. 
 
 Thanks in advance,
 Grace