hi, i am learning hadoop and currently doing python map reduce tutorial.  i am 
trying to understand the difference of having a map and reduce  files.
i am assumingwhen we lunch the scripts.The mapper.py script goes to all the 
machines at the same time and all start printing at the same time, and then the 
reducer goes to the reducer jobs and reads the lines what is coming from the 
jobs in no particular order?
1 can i just do a script that -get the file put it in a temp file and then work 
with it? (i guess this defeat the hole purposes of hadoop right?)
2 when working with a map script do i always need to print as key, value?  or i 
can print what ever i want?   and in what order does that comes?  if i read all 
the files of a folder like the tutorial say, are they been read in a sequential 
order by all the workers?can i make the mapper just print the lines of the 
file, and let the reducer do the logic of what i want to accomplish?



Writing An Hadoop MapReduce Program In Python - Michael G. Noll

  
|  
|   
|   
|   |    |

   |

  |
|  
|   |  
Writing An Hadoop MapReduce Program In Python - Michael G. Noll
 Por Michael G. Noll How to write an Hadoop MapReduce program in Python with 
the Hadoop Streaming API  |   |

  |

  |

 

following this tutorial,  i found the way of getting the information was making 
a directory like this.the mapper.pyimport sys

for i in sys.stdin:
    line = i.strip()
    words = line.split()
    for word in words:
        print word + "\t" + str(1)the reducer.pyimport sys

dic_words = {}
for i in sys.stdin:
    line = i.strip()
    word, one_value = line.split("\t")
    word_value = dic_words.get(word, 0)
    dic_words[word] = word_value + 1

for key, value in dic_words.items():
    print key, str(value)
when i test it against a file works,  or just testing it locally works 
too.something easy. echo "bla ble bli bla" | python mapper.py | sort -k1,1 | 
python reducer.pyand i do getbla 2ble 1bli 1
(not sure why we need the sort, i guess that emulates how hadoops works?  maybe 
hadoop mappers run first and then they return a dictionary that the reducer can 
read?)
thanks guys, i know there are weird question =(

Reply via email to