When using the Hadoop streaming jar if the reduce job outputs only a value (no
key) the code incorrectly outputs the value along with the tab character
(key/value) separator.
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Key: HADOOP-4913
URL: https://issues.apache.org/jira/browse/HADOOP-4913
Project: Hadoop Core
Issue Type: Bug
Components: contrib/streaming
Affects Versions: 0.18.2
Environment: Red Hat Linux 5.
Reporter: John Fisher
Priority: Minor
Fix For: site
I would like the output of my streaming job to only be the value, omitting the
key and key/value separator. However, when only printing the value I am
noticing that each line is ending with a tab character. I believe I have
tracked down the issue (described below) but I'm not 100% sure. The fix is
working for me though so I figured maybe it should be incorporated into the
code base.
The tab gets printed out because of a bad check in the TextOutputFormat code.
It checks if the "key" and "value" objects are null. If they are both not
null, then that means that the line should be printed as
<key><separator><value>, otherwise it should only print the key or value,
depending on what is defined. The bug is that the key and value are always
defined. I traced up further to see if the error was that these objects were
defined when they shouldn't be, but it looks like that's how it should work. I
changed the Hadoop code to look for a null object and also an empty string
length.
*** Patch code begin ***
if( ! nullKey ) {
nullKey = ( key.toString().length() == 0 );
}
if( ! nullValue ) {
nullValue = ( value.toString().length() == 0 );
}
*** Patch code end ***
The OutputCollector calls the TextOutputFormat,write() method with whatever
objects are passed into it (see ReduceTask.java, line 300) so that is fine.
But above that if you look at PipeMapRed.java, in the run() method you will see
that the code creates a new key and value object and then starts reading lines
and feeding them to the OutputCollector. This is why the key and value are
always defined by the time they hit the TextOutputFormat,write() and why we
always see the tab.
Thanks,
John
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.