Thanks for the reply Peter but that's not it. I use the comparator class to pass the -n flag but the shuffling does not sort the keys numerically. Tell me if this is wrong: 1. input (text file): 1324 212 123123 2332 145455 ..... 2. The mapper job will spawn a process that will run my ruby code passing each line via the stdin. My script will generate <key,value> where key = value = line 3. Hadoop will sort the keys prior to pass them to the reducer. It will sort them numerically because I pass the option -n to the compartor class. 4. The reducer feeds the lines into my reducer script, which behaves like the identity class. >From what I am seeing everything works like this expect the sorting is not done numerically. BTW, This is my latest command to submit the job: hadoop jar /home/drio/hadoop-0.20.0/contrib/streaming/hadoop-0.20.0-streaming.jar \ -D mapred.text.key.comparator.options=-n \ -input /input \ -output /output \ -mapper sort_mapper.rb \ -file `pwd`/scripts_sort/sort_mapper.rb \ -reducer sort_reducer.rb \ -file `pwd`/scripts_sort/sort_reducer.rb
I know I could use the identity classes and get rid of the scripts. I have tried out but I get an exception (I'll deal with it when I figure this first). -drd On Sat, May 16, 2009 at 11:42 PM, Peter Skomoroch <peter.skomor...@gmail.com > wrote: > I just copy and pasted that comparator option from the docs, the -n part is > what you want in this case. > > On Sun, May 17, 2009 at 12:40 AM, Peter Skomoroch < > peter.skomor...@gmail.com > > wrote: > > > 1) It is doing alphabetical sort by default, you can force Hadoop > streaming > > to sort numerically with: > > > > -D mapred.text.key.comparator.options=-k2,2nr\ > > > > see the section "A Useful Comparator Class" in the streaming docs: > > > > http://hadoop.apache.org/core/docs/current/streaming.html > > and https://issues.apache.org/jira/browse/HADOOP-2302 > > > > 2) For the second issue, I think you will need to use 1 reducer to > > guarantee global sort order or use another MR pass. > > > > > > > > On Sun, May 17, 2009 at 12:14 AM, David Rio <driodei...@gmail.com> > wrote: > > > > > > BTW, > > > Basically, this is the unix equivalent to what I am trying to do: > > > $ cat input_file.txt | sort -n > > > -drd > > > > > > On Sat, May 16, 2009 at 11:10 PM, David Rio <driodei...@gmail.com> > > wrote: > > > > > > > Hi, > > > > I am trying to sort some data with hadoop(streaming mode). The input > > looks > > > > like: > > > > $ cat small_numbers.txt > > > > 9971681 > > > > 9686036 > > > > 2592322 > > > > 4518219 > > > > 1467363 > > > > > > > > To send my job to the cluster I use: > > > > hadoop jar > > > > > /home/drio/hadoop-0.20.0/contrib/streaming/hadoop-0.20.0-streaming.jar > > \ > > > > -D "mapred.reduce.tasks=2" \ > > > > -D "stream.num.map.output.key.fields=1" \ > > > > -D mapred.text.key.comparator.options=-k1,1n \ > > > > -input /input \ > > > > -output /output \ > > > > -mapper sort_mapper.rb \ > > > > -file `pwd`/scripts_sort/sort_mapper.rb \ > > > > -reducer sort_reducer.rb \ > > > > -file `pwd`/scripts_sort/sort_reducer.rb > > > > > > > > The mapper code basically writes key, value = input_line, input_line. > > > > The reducer just prints the keys from the standard input. > > > > Incase you care: > > > > $ cat scripts_sort/sort_* > > > > #!/usr/bin/ruby > > > > > > > > STDIN.each_line {|l| puts "#{l.chomp}\t#{l.chomp}"} > > > > --------------------------------------------------------------------- > > > > #!/usr/bin/ruby > > > > > > > > STDIN.each_line { |line| puts line.split[0] } > > > > I run the job and it completes without problems, the output looks > like: > > > > d...@milhouse:~/tmp $ cat output/part-00001 > > > > 1380664 > > > > 1467363 > > > > 32485 > > > > 3857847 > > > > 422538 > > > > 4354952 > > > > 4518219 > > > > 5719091 > > > > 7838358 > > > > 9686036 > > > > d...@milhouse:~/tmp $ cat output/part-00000 > > > > 1453024 > > > > 2592322 > > > > 3875994 > > > > 4689583 > > > > 5340522 > > > > 607354 > > > > 6447778 > > > > 6535495 > > > > 8647464 > > > > 9971681 > > > > These are my questions: > > > > 1. It seems the sorting (per reducer) is working but I don't know > why, > > for > > > > example, > > > > 607354 is not the first number in the output. > > > > > > > > 2. How can I tell hadoop to send data to the reduces in such a way > that > > > > inputReduce1keys < > > > > inputReduce2keys < ..... < inputReduceNkeys. In that way I would > ensure > > the > > > > data > > > > is fully sorted once the job is done. > > > > I've tried also using the identity classes for the mapper and reducer > > but > > > > the job dies generating > > > > exceptions about the input format. > > > > Can anyone show me or point me to some code showing how to properly > > perform > > > > sorting. > > > > Thanks in advance, > > > > -drd > > > > > > > > > > > > > > > > -- > > Peter N. Skomoroch > > 617.285.8348 > > http://www.datawrangling.com > > http://delicious.com/pskomoroch > > http://twitter.com/peteskomoroch > > > > > > -- > Peter N. Skomoroch > 617.285.8348 > http://www.datawrangling.com > http://delicious.com/pskomoroch > http://twitter.com/peteskomoroch >