Streaming should use a subclass of TextInputFormat for reading text inputs.
---------------------------------------------------------------------------

                 Key: HADOOP-788
                 URL: http://issues.apache.org/jira/browse/HADOOP-788
             Project: Hadoop
          Issue Type: Improvement
          Components: contrib/streaming
            Reporter: Owen O'Malley
         Assigned To: Sanjay Dahiya


Currently streaming uses a lot of custom code for processing text inputs. 

I propose:

 1. Move class LineRecordReader  out of TextInputFormat.
 2. Make class StreamLineRecordReader extend LineRecordReader.
 3. StreamLineRecordReader uses LineRecordReader.next to read the lines and 
splits them on tab to generate a Text/Text key/value pair.

This will remove a lot of code from streaming and give it automatic support for 
the compression codecs that the "base" part of Hadoop enjoys. In particular, if 
the native zlib code is used, it will remove the 2gb limit on compressed files.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to