Re: hadoop questions for a begginer

Gurmukh Singh Sat, 30 Sep 2017 16:56:58 -0700

Well, in actual job the input will be a file.

so, instead of:


echo "bla ble bli bla" | python mapper.py | sort -k1,1 | python reducer.py

you will have:

cat file.txt | python mapper.py | sort -k1,1 | python reducer.py

The file has to be on HDFS (keeping simple, it can be otherfilesystems), then mapper.py is the map task logic, which will beexecuted on the data file "file.txt".

Depending upon the size of "file.txt" (number of hdfs blocks it has),that many map tasks will run.


The output of all the map tasks will go to the reducer for a final out put.

you can run the same program in hadoop as a streaming job:

$ hadoop jar contrib/streaming/hadoop-*streaming*.jar -file/home/hadoop/mapper.py -mapper /home/hadoop/mapper.py -file/home/hadoop/reducer.py -reducer /home/hadoop/reducer.py -input/file.txt -output /output

The above is a very simple explanation, let me know if you have anyfurther questions.




On 23/9/17 3:40 am, Demian Kurejwowski wrote:

hi, i am learning hadoop and currently doing python map reducetutorial. i am trying to understand the difference of having a mapand reduce files.
i am assumingwhen we lunch the scripts.
The mapper.py script goes to all the machines at the same time and allstart printing at the same time, and then the reducer goes to thereducer jobs and reads the lines what is coming from the jobs in noparticular order?
1 can i just do a script that -get the file put it in a temp file andthen work with it? (i guess this defeat the hole purposes of hadoopright?)
2 when working with a map script do i always need to print as key,value? or i can print what ever i want? and in what order does thatcomes? if i read all the files of a folder like the tutorial say, arethey been read in a sequential order by all the workers?can i make the mapper just print the lines of the file, and let thereducer do the logic of what i want to accomplish?
Writing An Hadoop MapReduce Program In Python - Michael G. Noll<http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/>
        


        


    Writing An Hadoop MapReduce Program In Python - Michael G. Noll

Por Michael G. Noll
How to write an Hadoop MapReduce program in Python with the HadoopStreaming API
        

<http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/>
following this tutorial, i found the way of getting the informationwas making a directory like this.
the mapper.py
import sys

for iin sys.stdin:
     line = i.strip()
     words = line.split()
     for wordin words:
         print word +"\t" +str(1)
the reducer.py
import sys

dic_words = {}
for iin sys.stdin:
     line = i.strip()
     word, one_value = line.split("\t")
     word_value = dic_words.get(word, 0)
     dic_words[word] = word_value +1 for key, valuein dic_words.items():
     print key, str(value)
when i test it against a file works, or just testing it locally workstoo.
something easy.
echo "bla ble bli bla" | python mapper.py | sort -k1,1 | python reducer.py
and i do get
bla 2
ble 1
bli 1
(not sure why we need the sort, i guess that emulates how hadoopsworks? maybe hadoop mappers run first and then they return adictionary that the reducer can read?)
thanks guys, i know there are weird question =(

Re: hadoop questions for a begginer

Reply via email to