BTW, Basically, this is the unix equivalent to what I am trying to do: $ cat input_file.txt | sort -n -drd
On Sat, May 16, 2009 at 11:10 PM, David Rio <driodei...@gmail.com> wrote: > Hi, > I am trying to sort some data with hadoop(streaming mode). The input looks > like: > $ cat small_numbers.txt > 9971681 > 9686036 > 2592322 > 4518219 > 1467363 > > To send my job to the cluster I use: > hadoop jar > /home/drio/hadoop-0.20.0/contrib/streaming/hadoop-0.20.0-streaming.jar \ > -D "mapred.reduce.tasks=2" \ > -D "stream.num.map.output.key.fields=1" \ > -D mapred.text.key.comparator.options=-k1,1n \ > -input /input \ > -output /output \ > -mapper sort_mapper.rb \ > -file `pwd`/scripts_sort/sort_mapper.rb \ > -reducer sort_reducer.rb \ > -file `pwd`/scripts_sort/sort_reducer.rb > > The mapper code basically writes key, value = input_line, input_line. > The reducer just prints the keys from the standard input. > Incase you care: > $ cat scripts_sort/sort_* > #!/usr/bin/ruby > > STDIN.each_line {|l| puts "#{l.chomp}\t#{l.chomp}"} > --------------------------------------------------------------------- > #!/usr/bin/ruby > > STDIN.each_line { |line| puts line.split[0] } > I run the job and it completes without problems, the output looks like: > d...@milhouse:~/tmp $ cat output/part-00001 > 1380664 > 1467363 > 32485 > 3857847 > 422538 > 4354952 > 4518219 > 5719091 > 7838358 > 9686036 > d...@milhouse:~/tmp $ cat output/part-00000 > 1453024 > 2592322 > 3875994 > 4689583 > 5340522 > 607354 > 6447778 > 6535495 > 8647464 > 9971681 > These are my questions: > 1. It seems the sorting (per reducer) is working but I don't know why, for > example, > 607354 is not the first number in the output. > > 2. How can I tell hadoop to send data to the reduces in such a way that > inputReduce1keys < > inputReduce2keys < ..... < inputReduceNkeys. In that way I would ensure the > data > is fully sorted once the job is done. > I've tried also using the identity classes for the mapper and reducer but > the job dies generating > exceptions about the input format. > Can anyone show me or point me to some code showing how to properly perform > sorting. > Thanks in advance, > -drd > >