Well, in actual job the input will be a file.
so, instead of:
echo "bla ble bli bla" | python mapper.py | sort -k1,1 | python reducer.py
you will have:
cat file.txt | python mapper.py | sort -k1,1 | python reducer.py
The file has to be on HDFS (keeping simple, it can be other
filesystems), then mapper.py is the map task logic, which will be
executed on the data file "file.txt".
Depending upon the size of "file.txt" (number of hdfs blocks it has),
that many map tasks will run.
The output of all the map tasks will go to the reducer for a final out put.
you can run the same program in hadoop as a streaming job:
$ hadoop jar contrib/streaming/hadoop-*streaming*.jar -file
/home/hadoop/mapper.py -mapper /home/hadoop/mapper.py -file
/home/hadoop/reducer.py -reducer /home/hadoop/reducer.py -input
/file.txt -output /output
The above is a very simple explanation, let me know if you have any
further questions.
On 23/9/17 3:40 am, Demian Kurejwowski wrote:
hi, i am learning hadoop and currently doing python map reduce
tutorial. i am trying to understand the difference of having a map
and reduce files.
i am assumingwhen we lunch the scripts.
The mapper.py script goes to all the machines at the same time and all
start printing at the same time, and then the reducer goes to the
reducer jobs and reads the lines what is coming from the jobs in no
particular order?
1 can i just do a script that -get the file put it in a temp file and
then work with it? (i guess this defeat the hole purposes of hadoop
right?)
2 when working with a map script do i always need to print as key,
value? or i can print what ever i want? and in what order does that
comes? if i read all the files of a folder like the tutorial say, are
they been read in a sequential order by all the workers?
can i make the mapper just print the lines of the file, and let the
reducer do the logic of what i want to accomplish?
Writing An Hadoop MapReduce Program In Python - Michael G. Noll
<http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/>
Writing An Hadoop MapReduce Program In Python - Michael G. Noll
Por Michael G. Noll
How to write an Hadoop MapReduce program in Python with the Hadoop
Streaming API
<http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/>
following this tutorial, i found the way of getting the information
was making a directory like this.
the mapper.py
import sys
for iin sys.stdin:
line = i.strip()
words = line.split()
for wordin words:
print word +"\t" +str(1)
the reducer.py
import sys
dic_words = {}
for iin sys.stdin:
line = i.strip()
word, one_value = line.split("\t")
word_value = dic_words.get(word, 0)
dic_words[word] = word_value +1 for key, valuein dic_words.items():
print key, str(value)
when i test it against a file works, or just testing it locally works
too.
something easy.
echo "bla ble bli bla" | python mapper.py | sort -k1,1 | python reducer.py
and i do get
bla 2
ble 1
bli 1
(not sure why we need the sort, i guess that emulates how hadoops
works? maybe hadoop mappers run first and then they return a
dictionary that the reducer can read?)
thanks guys, i know there are weird question =(