Take a looks at the way that the text input format moves to the next line
after a split point.

There are a couple of possible problems with your input format not found
problem.

First, is your input in a package?  If so, you need to provide a complete
name for the class.

Secondly, you have to give streaming information about how to package up
your input format class for transfer to the cluster.  Having access to the
class on your initial invoking machine is not sufficient.  At one point, it
was necessary to unpack the streaming.jar file and put your own classes and
jars into that.  Last time I looked at the code, however, there was support
for that happening magically, but in the 30 seconds I have allotted to help
you (sorry bout that), I can't see that there is a command line option to
trigger that, unless it is the one for including a file in the jar file.


On 4/4/08 3:00 AM, "Francesco Tamberi" <[EMAIL PROTECTED]> wrote:

> Hi All,
> I have a streaming tool chain written in c++/python that performs some
> operations on really big text files (gigabytes order); the chain reads files
> and writes its result to standard output.
> The chain needs to read well structured files and so I need to control how
> hadoop splits files: it should splits a file only at suitable places.
> What's the best way to do that?
> I'm trying defining a custom input format in that way but I'm not sure it's
> ok:
> 
> public class MyInputFormat extends FileInputFormat<LongWritable, Text> {
> ...
> 
> public InputSplit[] getSplits(JobConf job, int numSplits) throws IOException {
> ...
> }
> }
> 
> That said, I tried to run that (on hadoop 0.15.3, 0.16.0, 0.16.1) with:
> 
> $HADOOP_HOME/bin/hadoop jar
> $HADOOP_HOME/contrib/streaming/hadoop-0.16.1-streaming.jar -file ./allTools.sh
> -mapper "allTools.sh" -jobconf mapred.reduce.tasks=0 -file pathToMyClass.class
> -inputformat MyClass -input test.txt -output test-output
> 
> But it raises an exception "-inputformat : class not found : MyClass"
> I tried passing a jar instead of class file, putting them in HADOOP_CLASSPATH,
> putting in system' CLASSPATH but always the same result..
> 
> Thank you for your patience!
> -- Francesco
> 
> 
> 

Reply via email to