Hi, I am trying to sort some data with hadoop(streaming mode). The input looks like: $ cat small_numbers.txt 9971681 9686036 2592322 4518219 1467363
To send my job to the cluster I use: hadoop jar /home/drio/hadoop-0.20.0/contrib/streaming/hadoop-0.20.0-streaming.jar \ -D "mapred.reduce.tasks=2" \ -D "stream.num.map.output.key.fields=1" \ -D mapred.text.key.comparator.options=-k1,1n \ -input /input \ -output /output \ -mapper sort_mapper.rb \ -file `pwd`/scripts_sort/sort_mapper.rb \ -reducer sort_reducer.rb \ -file `pwd`/scripts_sort/sort_reducer.rb The mapper code basically writes key, value = input_line, input_line. The reducer just prints the keys from the standard input. Incase you care: $ cat scripts_sort/sort_* #!/usr/bin/ruby STDIN.each_line {|l| puts "#{l.chomp}\t#{l.chomp}"} --------------------------------------------------------------------- #!/usr/bin/ruby STDIN.each_line { |line| puts line.split[0] } I run the job and it completes without problems, the output looks like: d...@milhouse:~/tmp $ cat output/part-00001 1380664 1467363 32485 3857847 422538 4354952 4518219 5719091 7838358 9686036 d...@milhouse:~/tmp $ cat output/part-00000 1453024 2592322 3875994 4689583 5340522 607354 6447778 6535495 8647464 9971681 These are my questions: 1. It seems the sorting (per reducer) is working but I don't know why, for example, 607354 is not the first number in the output. 2. How can I tell hadoop to send data to the reduces in such a way that inputReduce1keys < inputReduce2keys < ..... < inputReduceNkeys. In that way I would ensure the data is fully sorted once the job is done. I've tried also using the identity classes for the mapper and reducer but the job dies generating exceptions about the input format. Can anyone show me or point me to some code showing how to properly perform sorting. Thanks in advance, -drd