When using the Hadoop streaming jar if the reduce job outputs only a value (no 
key) the code incorrectly outputs the value along with the tab character 
(key/value) separator.
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

                 Key: HADOOP-4913
                 URL: https://issues.apache.org/jira/browse/HADOOP-4913
             Project: Hadoop Core
          Issue Type: Bug
          Components: contrib/streaming
    Affects Versions: 0.18.2
         Environment: Red Hat Linux 5.
            Reporter: John Fisher
            Priority: Minor
             Fix For: site


I would like the output of my streaming job to only be the value, omitting the 
key and key/value separator.  However, when only printing the value I am 
noticing that each line is ending with a tab character.  I believe I have 
tracked down the issue (described below) but I'm not 100% sure.  The fix is 
working for me though so I figured maybe it should be incorporated into the 
code base.

The tab gets printed out because of a bad check in the TextOutputFormat code.  
It checks if the "key" and "value" objects are null.  If they are both not 
null, then that means that the line should be printed as 
<key><separator><value>, otherwise it should only print the key or value, 
depending on what is defined.  The bug is that the key and value are always 
defined.  I traced up further to see if the error was that these objects were 
defined when they shouldn't be, but it looks like that's how it should work.  I 
changed the Hadoop code to look for a null object and also an empty string 
length.

*** Patch code begin ***

if( ! nullKey ) {
  nullKey = ( key.toString().length() == 0 );
}
if( ! nullValue ) {
  nullValue = ( value.toString().length() == 0 );
}

*** Patch code end ***

The OutputCollector calls the TextOutputFormat,write() method with whatever 
objects are passed into it (see ReduceTask.java, line 300) so that is fine.

But above that if you look at PipeMapRed.java, in the run() method you will see 
that the code creates a new key and value object and then starts reading lines 
and feeding them to the OutputCollector.  This is why the key and value are 
always defined by the time they hit the TextOutputFormat,write() and why we 
always see the tab.

Thanks,
John

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to