bbeaudreault commented on PR #4215: URL: https://github.com/apache/hadoop/pull/4215#issuecomment-1430703431
Just want to clarify the use-case here (as a reminder and for the new reviewers): - When you write a SequenceFile, the key class and value class are encoded in the format. - When you go to read that file later, the SequenceFile.Reader gets those key/value class names from the headers and tries to load the class so it can parse the keys/values. This is where WritableName first comes in, with the default being `Class.forName(keyClass)`. - However, some SequenceFiles may be very old and the classes may have been renamed. In this case, the above would throw ClassNotFoundException. - This is why `WritableName.addName` exists, which allows you to specify aliases pointing those old class names at whatever the new name is. When `WritableName.getClass` is called it will check to see if an alias was registered by calling addName prior to opening the SequenceFile. If so, it returns that class. This all worked when all key/value extend Writable, but Hadoop also supports Serialization framework. You can specify `io.serializations` to register serializations, and the SerializationFactory will try finding a serializer for the key or value class. Serializations have an `boolean accept(Class)` method, and one of the registered serializations need to return true for that. So when the same "old sequence file contains a key or value class that has been renamed" problem happens, if you are using Serialization you are out of luck. By default you'd get a ClassNotFoundException, and if you tried doing WritableName.addName, you'd get ClassCastException. ---- The simplest fix for that seemed to be in WritableName which appeared to be IA.Private and have no real usages in the repo outside of SequenceFile. A one line change there was attractive. The risk here seems pretty low, at least for how SequenceFile uses this class. If we have concerns here, there are other possible more involved solutions we could discuss. For example, we could add something in SerializationFactory to add aliases. This would be more involved though because it'd require a slight refactor in SequenceFile and we'd have to make sure that new API worked for any other usages of SerializationFactory. That's why I chose the simple 1 liner approach, since it solves the problem with simplicity and minimal external impact. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org