sort example

David Rio Sat, 16 May 2009 21:11:26 -0700

Hi,
I am trying to sort some data with hadoop(streaming mode). The input looks
like:
 $ cat small_numbers.txt
9971681
9686036
2592322
4518219
1467363


To send my job to the cluster I use:
hadoop jar
/home/drio/hadoop-0.20.0/contrib/streaming/hadoop-0.20.0-streaming.jar \
-D "mapred.reduce.tasks=2" \
-D "stream.num.map.output.key.fields=1" \
-D mapred.text.key.comparator.options=-k1,1n \
-input /input \
-output /output \
-mapper sort_mapper.rb \
-file `pwd`/scripts_sort/sort_mapper.rb \
-reducer sort_reducer.rb \
-file `pwd`/scripts_sort/sort_reducer.rb

The mapper code basically writes key, value = input_line, input_line.
The reducer just prints the keys from the standard input.
Incase you care:
 $ cat scripts_sort/sort_*
#!/usr/bin/ruby

STDIN.each_line {|l| puts "#{l.chomp}\t#{l.chomp}"}
---------------------------------------------------------------------
#!/usr/bin/ruby

STDIN.each_line { |line| puts line.split[0] }
I run the job and it completes without problems, the output looks like:
d...@milhouse:~/tmp $ cat output/part-00001
1380664
1467363
32485
3857847
422538
4354952
4518219
5719091
7838358
9686036
d...@milhouse:~/tmp $ cat output/part-00000
1453024
2592322
3875994
4689583
5340522
607354
6447778
6535495
8647464
9971681
These are my questions:
1. It seems the sorting (per reducer) is working but I don't know why, for
example,
607354 is not the first number in the output.

2. How can I tell hadoop to send data to the reduces in such a way that
inputReduce1keys <
inputReduce2keys < ..... < inputReduceNkeys. In that way I would ensure the
data
is fully sorted once the job is done.
I've tried also using the identity classes for the mapper and reducer but
the job dies generating
exceptions about the input format.
Can anyone show me or point me to some code showing how to properly perform
sorting.
Thanks in advance,
-drd

sort example

Reply via email to