I just copy and pasted that comparator option from the docs, the -n part is
what you want in this case.

On Sun, May 17, 2009 at 12:40 AM, Peter Skomoroch <peter.skomor...@gmail.com
> wrote:

> 1) It is doing alphabetical sort by default, you can force Hadoop streaming
> to sort numerically with:
>
> -D mapred.text.key.comparator.options=-k2,2nr\
>
> see the section "A Useful Comparator Class" in the streaming docs:
>
> http://hadoop.apache.org/core/docs/current/streaming.html
> and https://issues.apache.org/jira/browse/HADOOP-2302
>
> 2) For the second issue, I think you will need to use 1 reducer to
> guarantee global sort order or use another MR pass.
>
>
>
> On Sun, May 17, 2009 at 12:14 AM, David Rio <driodei...@gmail.com> wrote:
> >
> > BTW,
> > Basically, this is the unix equivalent to what I am trying to do:
> > $ cat input_file.txt | sort -n
> > -drd
> >
> > On Sat, May 16, 2009 at 11:10 PM, David Rio <driodei...@gmail.com>
> wrote:
> >
> > > Hi,
> > > I am trying to sort some data with hadoop(streaming mode). The input
> looks
> > > like:
> > >  $ cat small_numbers.txt
> > > 9971681
> > > 9686036
> > > 2592322
> > > 4518219
> > > 1467363
> > >
> > > To send my job to the cluster I use:
> > > hadoop jar
> > > /home/drio/hadoop-0.20.0/contrib/streaming/hadoop-0.20.0-streaming.jar
> \
> > > -D "mapred.reduce.tasks=2" \
> > > -D "stream.num.map.output.key.fields=1" \
> > > -D mapred.text.key.comparator.options=-k1,1n \
> > > -input /input \
> > > -output /output \
> > > -mapper sort_mapper.rb \
> > > -file `pwd`/scripts_sort/sort_mapper.rb \
> > > -reducer sort_reducer.rb \
> > > -file `pwd`/scripts_sort/sort_reducer.rb
> > >
> > > The mapper code basically writes key, value = input_line, input_line.
> > > The reducer just prints the keys from the standard input.
> > > Incase you care:
> > >  $ cat scripts_sort/sort_*
> > > #!/usr/bin/ruby
> > >
> > > STDIN.each_line {|l| puts "#{l.chomp}\t#{l.chomp}"}
> > > ---------------------------------------------------------------------
> > > #!/usr/bin/ruby
> > >
> > > STDIN.each_line { |line| puts line.split[0] }
> > > I run the job and it completes without problems, the output looks like:
> > > d...@milhouse:~/tmp $ cat output/part-00001
> > > 1380664
> > > 1467363
> > > 32485
> > > 3857847
> > > 422538
> > > 4354952
> > > 4518219
> > > 5719091
> > > 7838358
> > > 9686036
> > > d...@milhouse:~/tmp $ cat output/part-00000
> > > 1453024
> > > 2592322
> > > 3875994
> > > 4689583
> > > 5340522
> > > 607354
> > > 6447778
> > > 6535495
> > > 8647464
> > > 9971681
> > > These are my questions:
> > > 1. It seems the sorting (per reducer) is working but I don't know why,
> for
> > > example,
> > > 607354 is not the first number in the output.
> > >
> > > 2. How can I tell hadoop to send data to the reduces in such a way that
> > > inputReduce1keys <
> > > inputReduce2keys < ..... < inputReduceNkeys. In that way I would ensure
> the
> > > data
> > > is fully sorted once the job is done.
> > > I've tried also using the identity classes for the mapper and reducer
> but
> > > the job dies generating
> > > exceptions about the input format.
> > > Can anyone show me or point me to some code showing how to properly
> perform
> > > sorting.
> > > Thanks in advance,
> > > -drd
> > >
> > >
>
>
>
> --
> Peter N. Skomoroch
> 617.285.8348
> http://www.datawrangling.com
> http://delicious.com/pskomoroch
> http://twitter.com/peteskomoroch
>



-- 
Peter N. Skomoroch
617.285.8348
http://www.datawrangling.com
http://delicious.com/pskomoroch
http://twitter.com/peteskomoroch

Reply via email to