On Mon, 16 Oct 2006 18:43:30 +0200, Dennis Kubes
<[EMAIL PROTECTED]> wrote:
InputFormatBase is used by some of the other input formats such as
SequenceFileInputFormat so changing it there will affect those other
classes as well. I don't know if that is what you want or not. I would
probably extend TextInputFormat (assuming the files are in text logs
such as apache logs and not xml files) and override the
areValidInputDirectories to checks for files in the directories and the
getSplits to return splits with only the files that you want to process.
Thanks for your suggestions ... Here's how I did it:
* When configuring the job, individual files are added with
JobConf.addInputPath
* The inputformat is set to be TextFileInputFormat, which is subclassed
from TextInputFormat
* TextFileInputFormat overloads the following methods:
- areValidInputDirectories: This is set to return true, even if one of
the input paths is a file
- listPaths: in InputFormatBase, this method simply returns a list of
all the files in the input directories. I overloaded this to test if the
input path is a directory or a file, and simply add the input path
directly if it's a file.
This enables input to both be files and directories, and it seems to work
like a charm.
[...]
--
Vetle Roeim
Team Manager, Information Systems
Opera Software ASA <URL: http://www.opera.com/ >