[ https://issues.apache.org/jira/browse/HADOOP-788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12465134 ]
Sanjay Dahiya commented on HADOOP-788: -------------------------------------- patch is attached for review. > Streaming should use a subclass of TextInputFormat for reading text inputs. > --------------------------------------------------------------------------- > > Key: HADOOP-788 > URL: https://issues.apache.org/jira/browse/HADOOP-788 > Project: Hadoop > Issue Type: Improvement > Components: contrib/streaming > Reporter: Owen O'Malley > Assigned To: Sanjay Dahiya > Attachments: Hadoop-788.patch > > > Currently streaming uses a lot of custom code for processing text inputs. > I propose: > 1. Move class LineRecordReader out of TextInputFormat. > 2. Make class StreamLineRecordReader extend LineRecordReader. > 3. StreamLineRecordReader uses LineRecordReader.next to read the lines and > splits them on tab to generate a Text/Text key/value pair. > This will remove a lot of code from streaming and give it automatic support > for the compression codecs that the "base" part of Hadoop enjoys. In > particular, if the native zlib code is used, it will remove the 2gb limit on > compressed files. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira