[ 
https://issues.apache.org/jira/browse/HADOOP-788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12465134
 ] 

Sanjay Dahiya commented on HADOOP-788:
--------------------------------------

patch is attached for review. 

> Streaming should use a subclass of TextInputFormat for reading text inputs.
> ---------------------------------------------------------------------------
>
>                 Key: HADOOP-788
>                 URL: https://issues.apache.org/jira/browse/HADOOP-788
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: contrib/streaming
>            Reporter: Owen O'Malley
>         Assigned To: Sanjay Dahiya
>         Attachments: Hadoop-788.patch
>
>
> Currently streaming uses a lot of custom code for processing text inputs. 
> I propose:
>  1. Move class LineRecordReader  out of TextInputFormat.
>  2. Make class StreamLineRecordReader extend LineRecordReader.
>  3. StreamLineRecordReader uses LineRecordReader.next to read the lines and 
> splits them on tab to generate a Text/Text key/value pair.
> This will remove a lot of code from streaming and give it automatic support 
> for the compression codecs that the "base" part of Hadoop enjoys. In 
> particular, if the native zlib code is used, it will remove the 2gb limit on 
> compressed files.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to