I just copy and pasted that comparator option from the docs, the -n part is what you want in this case.
On Sun, May 17, 2009 at 12:40 AM, Peter Skomoroch <peter.skomor...@gmail.com > wrote: > 1) It is doing alphabetical sort by default, you can force Hadoop streaming > to sort numerically with: > > -D mapred.text.key.comparator.options=-k2,2nr\ > > see the section "A Useful Comparator Class" in the streaming docs: > > http://hadoop.apache.org/core/docs/current/streaming.html > and https://issues.apache.org/jira/browse/HADOOP-2302 > > 2) For the second issue, I think you will need to use 1 reducer to > guarantee global sort order or use another MR pass. > > > > On Sun, May 17, 2009 at 12:14 AM, David Rio <driodei...@gmail.com> wrote: > > > > BTW, > > Basically, this is the unix equivalent to what I am trying to do: > > $ cat input_file.txt | sort -n > > -drd > > > > On Sat, May 16, 2009 at 11:10 PM, David Rio <driodei...@gmail.com> > wrote: > > > > > Hi, > > > I am trying to sort some data with hadoop(streaming mode). The input > looks > > > like: > > > $ cat small_numbers.txt > > > 9971681 > > > 9686036 > > > 2592322 > > > 4518219 > > > 1467363 > > > > > > To send my job to the cluster I use: > > > hadoop jar > > > /home/drio/hadoop-0.20.0/contrib/streaming/hadoop-0.20.0-streaming.jar > \ > > > -D "mapred.reduce.tasks=2" \ > > > -D "stream.num.map.output.key.fields=1" \ > > > -D mapred.text.key.comparator.options=-k1,1n \ > > > -input /input \ > > > -output /output \ > > > -mapper sort_mapper.rb \ > > > -file `pwd`/scripts_sort/sort_mapper.rb \ > > > -reducer sort_reducer.rb \ > > > -file `pwd`/scripts_sort/sort_reducer.rb > > > > > > The mapper code basically writes key, value = input_line, input_line. > > > The reducer just prints the keys from the standard input. > > > Incase you care: > > > $ cat scripts_sort/sort_* > > > #!/usr/bin/ruby > > > > > > STDIN.each_line {|l| puts "#{l.chomp}\t#{l.chomp}"} > > > --------------------------------------------------------------------- > > > #!/usr/bin/ruby > > > > > > STDIN.each_line { |line| puts line.split[0] } > > > I run the job and it completes without problems, the output looks like: > > > d...@milhouse:~/tmp $ cat output/part-00001 > > > 1380664 > > > 1467363 > > > 32485 > > > 3857847 > > > 422538 > > > 4354952 > > > 4518219 > > > 5719091 > > > 7838358 > > > 9686036 > > > d...@milhouse:~/tmp $ cat output/part-00000 > > > 1453024 > > > 2592322 > > > 3875994 > > > 4689583 > > > 5340522 > > > 607354 > > > 6447778 > > > 6535495 > > > 8647464 > > > 9971681 > > > These are my questions: > > > 1. It seems the sorting (per reducer) is working but I don't know why, > for > > > example, > > > 607354 is not the first number in the output. > > > > > > 2. How can I tell hadoop to send data to the reduces in such a way that > > > inputReduce1keys < > > > inputReduce2keys < ..... < inputReduceNkeys. In that way I would ensure > the > > > data > > > is fully sorted once the job is done. > > > I've tried also using the identity classes for the mapper and reducer > but > > > the job dies generating > > > exceptions about the input format. > > > Can anyone show me or point me to some code showing how to properly > perform > > > sorting. > > > Thanks in advance, > > > -drd > > > > > > > > > > -- > Peter N. Skomoroch > 617.285.8348 > http://www.datawrangling.com > http://delicious.com/pskomoroch > http://twitter.com/peteskomoroch > -- Peter N. Skomoroch 617.285.8348 http://www.datawrangling.com http://delicious.com/pskomoroch http://twitter.com/peteskomoroch