Hi,
I was reading through the Streaming documentation
(http://hadoop.apache.org/core/docs/r0.15.3/streaming.html), and the
KeyFieldBasedPartitioner example might need some fixing.
First I got errors about Text vs LongWriteable because of the
IdentityMapper/Reducer, and I changed those to "cat".
Next I believe the partitioner class is using regexes to do the split and the
"map.output.key.field.separator" needs to be quoted to "\.".
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper cat \
-reducer cat \
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
-jobconf stream.map.output.field.separator=. \
-jobconf stream.num.map.output.key.fields=4 \
-jobconf map.output.key.field.separator="\." \
-jobconf num.key.fields.for.partition=2 \
-jobconf mapred.reduce.tasks=12
Ideally though I think this partitioner should be fixed to not use regexes, and
just use indexOf or some such.
Of course I'm relatively new to Hadoop (which is why I'm reading the
documentation!), and might just be misunderstanding something here.
Richendra