Order of Operations

Premal Shah Fri, 05 Aug 2011 11:00:47 -0700

Hi,
According to the attached image found on yahoo's hadoop
tutorial<http://developer.yahoo.com/hadoop/tutorial/module4.html>,
the order of operations is map > combine > partition which should be
followed by reduce


Here is my an example key emmited by the map operation
LongValueSum:geo_US|1311722400|E     1

This should get combined with other keys as
geo_US|1311722400|E     100
(assuming there are 100 keys of the same type)

Then i'd like to partition the keys by the value before the first pipe(|)
http://hadoop.apache.org/common/docs/r0.20.2/streaming.html#Working+with+the+Hadoop+Aggregate+Package+%28the+-reduce+aggregate+option%29
geo_US

so here's my streaming command

hadoop jar
/usr/local/hadoop/contrib/streaming/hadoop-streaming-0.20.203.0.jar \
    -D mapred.reduce.tasks=8 \
    -D stream.num.map.output.key.fields=1 \
    -D mapred.text.key.partitioner.options=-k1,1 \
    -D stream.map.output.field.separator=\| \
    -file mapper.py \
    -mapper mapper.py \
    -file reducer.py \
    -reducer reducer.py \
    -combiner org.apache.hadoop.mapred.lib.aggregate.ValueAggregatorReducer
\
    -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
    -input input_file \
    -output output_path


This is the error I get

java.lang.NumberFormatException: For input string: "1311722400|E        1"
        at 
java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
        at java.lang.Long.parseLong(Long.java:419)
        at java.lang.Long.parseLong(Long.java:468)*     at
org.apache.hadoop.mapred.lib.aggregate.LongValueSum.addNextValue(LongValueSum.java:48)
        at 
org.apache.hadoop.mapred.lib.aggregate.ValueAggregatorReducer.reduce(ValueAggregatorReducer.java:59)
        at 
org.apache.hadoop.mapred.lib.aggregate.ValueAggregatorReducer.reduce(ValueAggregatorReducer.java:35)*
        at
org.apache.hadoop.mapred.Task$OldCombinerRunner.combine(Task.java:1349)
        at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1435)
        at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1297)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:371)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
        at org.apache.hadoop.mapred.Child.main(Child.java:253)

I think its because the partitioner is running before the combiner.
Any thoughts?

-- 
Regards,
Premal Shah.

Order of Operations

Reply via email to