[ https://issues.apache.org/jira/browse/MAPREDUCE-6000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072976#comment-14072976 ]
Todd Lipcon commented on MAPREDUCE-6000: ---------------------------------------- I chatted offline with Sean a bit about this one, and we realized there are a couple issues: - he made the point that, even if we don't internally use all of the methods, users might expect a DataOutput/DataInput implementation to be fully functional. - I hadn't realized that DataInput/DataOutput use "modified UTF-8", and not actually UTF-8. I added a new test case with a unicode cat face and found that my new implementation was incompatible with Java's implementation, since I was using "real UTF8" and not "modified". But, we still need to do something like this patch, since the current code actually appears to copy-paste stuff from the Oracle JDK, and thus isn't Apache-license-compatible. So, I'll upload a new rev of the patch with the following changes since the previous one: - add a test case with unicode cat face (it's outside the "basic multilingual plane" so tends to expose bugs like the above) - instead of trying to implement the more complex methods, the reader/writer classes now have an internal java.io.Data{Input,Output}Stream instance. We delegate to these for any of the more substantial methods, so that we don't have to duplicate any encoding/decoding code. This is safe since those classes don't do any of their own buffering. > native-task: Simplify ByteBufferDataReader/Writer > ------------------------------------------------- > > Key: MAPREDUCE-6000 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6000 > Project: Hadoop Map/Reduce > Issue Type: Sub-task > Components: task > Reporter: Todd Lipcon > Assignee: Todd Lipcon > Priority: Minor > Attachments: mapreduce-6000.txt, mapreduce-6000.txt > > > The ByteBufferDataReader and ByteBufferDataWriter class are more complex than > necessary: > - several methods related to reading/writing strings and char arrays are > implemented but never used by the native task code. Given that the use case > for these classes is limited to serializing binary data to/from the native > code, it seems unlikely people will want to use these methods in any > performance-critical space. So, let's do simpler implementations that are > less likely to be buggy, even if they're slightly less performant. > - methods like readLine() are even less likely to be used. Since it's a > complex implementation, let's just throw UnsupportedOperationException > - in the test case, we can use Mockito to shorten the amount of new code -- This message was sent by Atlassian JIRA (v6.2#6252)