BTW,
Basically, this is the unix equivalent to what I am trying to do:
$ cat input_file.txt | sort -n
-drd

On Sat, May 16, 2009 at 11:10 PM, David Rio <driodei...@gmail.com> wrote:

> Hi,
> I am trying to sort some data with hadoop(streaming mode). The input looks
> like:
>  $ cat small_numbers.txt
> 9971681
> 9686036
> 2592322
> 4518219
> 1467363
>
> To send my job to the cluster I use:
> hadoop jar
> /home/drio/hadoop-0.20.0/contrib/streaming/hadoop-0.20.0-streaming.jar \
> -D "mapred.reduce.tasks=2" \
> -D "stream.num.map.output.key.fields=1" \
> -D mapred.text.key.comparator.options=-k1,1n \
> -input /input \
> -output /output \
> -mapper sort_mapper.rb \
> -file `pwd`/scripts_sort/sort_mapper.rb \
> -reducer sort_reducer.rb \
> -file `pwd`/scripts_sort/sort_reducer.rb
>
> The mapper code basically writes key, value = input_line, input_line.
> The reducer just prints the keys from the standard input.
> Incase you care:
>  $ cat scripts_sort/sort_*
> #!/usr/bin/ruby
>
> STDIN.each_line {|l| puts "#{l.chomp}\t#{l.chomp}"}
> ---------------------------------------------------------------------
> #!/usr/bin/ruby
>
> STDIN.each_line { |line| puts line.split[0] }
> I run the job and it completes without problems, the output looks like:
> d...@milhouse:~/tmp $ cat output/part-00001
> 1380664
> 1467363
> 32485
> 3857847
> 422538
> 4354952
> 4518219
> 5719091
> 7838358
> 9686036
> d...@milhouse:~/tmp $ cat output/part-00000
> 1453024
> 2592322
> 3875994
> 4689583
> 5340522
> 607354
> 6447778
> 6535495
> 8647464
> 9971681
> These are my questions:
> 1. It seems the sorting (per reducer) is working but I don't know why, for
> example,
> 607354 is not the first number in the output.
>
> 2. How can I tell hadoop to send data to the reduces in such a way that
> inputReduce1keys <
> inputReduce2keys < ..... < inputReduceNkeys. In that way I would ensure the
> data
> is fully sorted once the job is done.
> I've tried also using the identity classes for the mapper and reducer but
> the job dies generating
> exceptions about the input format.
> Can anyone show me or point me to some code showing how to properly perform
> sorting.
> Thanks in advance,
> -drd
>
>

Reply via email to