1) It is doing alphabetical sort by default, you can force Hadoop streaming to sort numerically with:
-D mapred.text.key.comparator.options=-k2,2nr\ see the section "A Useful Comparator Class" in the streaming docs: http://hadoop.apache.org/core/docs/current/streaming.html and https://issues.apache.org/jira/browse/HADOOP-2302 2) For the second issue, I think you will need to use 1 reducer to guarantee global sort order or use another MR pass. On Sun, May 17, 2009 at 12:14 AM, David Rio <driodei...@gmail.com> wrote: > > BTW, > Basically, this is the unix equivalent to what I am trying to do: > $ cat input_file.txt | sort -n > -drd > > On Sat, May 16, 2009 at 11:10 PM, David Rio <driodei...@gmail.com> wrote: > > > Hi, > > I am trying to sort some data with hadoop(streaming mode). The input looks > > like: > > $ cat small_numbers.txt > > 9971681 > > 9686036 > > 2592322 > > 4518219 > > 1467363 > > > > To send my job to the cluster I use: > > hadoop jar > > /home/drio/hadoop-0.20.0/contrib/streaming/hadoop-0.20.0-streaming.jar \ > > -D "mapred.reduce.tasks=2" \ > > -D "stream.num.map.output.key.fields=1" \ > > -D mapred.text.key.comparator.options=-k1,1n \ > > -input /input \ > > -output /output \ > > -mapper sort_mapper.rb \ > > -file `pwd`/scripts_sort/sort_mapper.rb \ > > -reducer sort_reducer.rb \ > > -file `pwd`/scripts_sort/sort_reducer.rb > > > > The mapper code basically writes key, value = input_line, input_line. > > The reducer just prints the keys from the standard input. > > Incase you care: > > $ cat scripts_sort/sort_* > > #!/usr/bin/ruby > > > > STDIN.each_line {|l| puts "#{l.chomp}\t#{l.chomp}"} > > --------------------------------------------------------------------- > > #!/usr/bin/ruby > > > > STDIN.each_line { |line| puts line.split[0] } > > I run the job and it completes without problems, the output looks like: > > d...@milhouse:~/tmp $ cat output/part-00001 > > 1380664 > > 1467363 > > 32485 > > 3857847 > > 422538 > > 4354952 > > 4518219 > > 5719091 > > 7838358 > > 9686036 > > d...@milhouse:~/tmp $ cat output/part-00000 > > 1453024 > > 2592322 > > 3875994 > > 4689583 > > 5340522 > > 607354 > > 6447778 > > 6535495 > > 8647464 > > 9971681 > > These are my questions: > > 1. It seems the sorting (per reducer) is working but I don't know why, for > > example, > > 607354 is not the first number in the output. > > > > 2. How can I tell hadoop to send data to the reduces in such a way that > > inputReduce1keys < > > inputReduce2keys < ..... < inputReduceNkeys. In that way I would ensure the > > data > > is fully sorted once the job is done. > > I've tried also using the identity classes for the mapper and reducer but > > the job dies generating > > exceptions about the input format. > > Can anyone show me or point me to some code showing how to properly perform > > sorting. > > Thanks in advance, > > -drd > > > > -- Peter N. Skomoroch 617.285.8348 http://www.datawrangling.com http://delicious.com/pskomoroch http://twitter.com/peteskomoroch