Hello, I need to use Hadoop Streaming to run several instances of a single
program on different files. Before doing it, I wrote a simple test
application as the mapper, which basically outputs the standard input
without doing anything useful. So it looks like the following:

---------------------------echo.sh--------------------------
echo "Running mapper, input is $1"
---------------------------echo.sh--------------------------

For the input, I created a single text file input.txt that has number from 1
to 10 on each line, so it goes like:

-----------input.txt---------------
1
2
..
10
-----------input.txt---------------

I uploaded input.txt on hdfs://stream/ directory and then ran Hadoop
Streaming utility as follows:

bin/hadoop jar hadoop-0.18.0-streaming.jar  \
-input /stream \
-output /trash \
-mapper echo.sh \
-file echo.sh \
-jobconf mapred.reduce.tasks=0

and from what I understood in the streaming tutorial, I expected that each
mapper would run an instance of echo.sh with one of the lines in input.txt
so I expected to get an output in the form of

Running mapper, input is 2
Running mapper, input is 5
...
and so on but I got only two output files, part-00000 and part-00001 that
contain the string "Running mapper, input is  ". As far as I see, the
mappers ran the mapper script echo.sh without the standard input. I basicly
followed the tutorial and I'm confused now so could you please tell me what
I'm missing here?

Thanks in advance,
Jim

Reply via email to