[
https://issues.apache.org/jira/browse/MAPREDUCE-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12988299#action_12988299
]
Todd Lipcon commented on MAPREDUCE-2254:
----------------------------------------
It seems you're being inconsistent here - why is it that LineReader shouldn't
take an arbitrary delimiter but LineRecordReader should? What I mean here is
that either the concept of a "line" is a sequence of characters with a newline,
or it's a sequence of characters with an arbitrary delimiter. If "line" means
something with a newline, then maybe this new feature should go in a new class
like DelimitedTextInputFormat or something? If "line" really could be delimited
by anything, then I would support moving this support up to LineReader, with a
different constructor. That way at least the similar code will be next to each
other.
It just smells really bad to me to extend a class and then reimplement its only
nontrivial method. Maybe we could alternatively extract an interface here?
> Allow setting of end-of-record delimiter for TextInputFormat
> ------------------------------------------------------------
>
> Key: MAPREDUCE-2254
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2254
> Project: Hadoop Map/Reduce
> Issue Type: Improvement
> Reporter: Ahmed Radwan
> Attachments: MAPREDUCE-2245.patch
>
>
> It will be useful to allow setting the end-of-record delimiter for
> TextInputFormat. The current implementation hardcodes '\n', '\r' or '\r\n' as
> the only possible record delimiters. This is a problem if users have embedded
> newlines in their data fields (which is pretty common). This is also a
> problem for other tools using this TextInputFormat (See for example:
> https://issues.apache.org/jira/browse/PIG-836 and
> https://issues.cloudera.org/browse/SQOOP-136).
> I have wrote a patch to address this issue. This patch allows users to
> specify any custom end-of-record delimiter using a new added configuration
> property. For backward compatibility, if this new configuration property is
> absent, then the same exact previous delimiters are used (i.e., '\n', '\r' or
> '\r\n').
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.