TextInputFormat does not correctly handle all line endings
----------------------------------------------------------
Key: HADOOP-473
URL: http://issues.apache.org/jira/browse/HADOOP-473
Project: Hadoop
Issue Type: Bug
Components: mapred
Affects Versions: 0.5.0, 0.6.0
Environment: All environments
Reporter: Dennis Kubes
Attachments: text-input-format.patch
The current TextInputFormat readLine method calls break on either a single '\r'
or '\n' character. This causes windows formatted text files '\r' '\n' to leave
a trailing '\n' character and the next time the readLine method is called on
the same input stream it returns a blank string. The patch attached corrects
this issue by looking for either single or double character line endings and
positions the input stream to the next line. It correctly handles windows,
mac, and unix line endings.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira