bbeaudreault commented on PR #4215:
URL: https://github.com/apache/hadoop/pull/4215#issuecomment-1430703431

   Just want to clarify the use-case here (as a reminder and for the new 
reviewers):
   
   - When you write a SequenceFile, the key class and value class are encoded 
in the format.
   - When you go to read that file later, the SequenceFile.Reader gets those 
key/value class names from the headers and tries to load the class so it can 
parse the keys/values. This is where WritableName first comes in, with the 
default being `Class.forName(keyClass)`.
   - However, some SequenceFiles may be very old and the classes may have been 
renamed. In this case, the above would throw ClassNotFoundException.
   - This is why `WritableName.addName` exists, which allows you to specify 
aliases pointing those old class names at whatever the new name is. When 
`WritableName.getClass` is called it will check to see if an alias was 
registered by calling addName prior to opening the SequenceFile. If so, it 
returns that class.
   
   This all worked when all key/value extend Writable, but Hadoop also supports 
Serialization framework. You can specify `io.serializations` to register 
serializations, and the SerializationFactory will try finding a serializer for 
the key or value class. Serializations have an `boolean accept(Class)` method, 
and one of the registered serializations need to return true for that.
   
   So when the same "old sequence file contains a key or value class that has 
been renamed" problem happens, if you are using Serialization you are out of 
luck. By default you'd get a ClassNotFoundException, and if you tried doing 
WritableName.addName, you'd get ClassCastException.
   
   ----
   
   The simplest fix for that seemed to be in WritableName which appeared to be 
IA.Private and have no real usages in the repo outside of SequenceFile. A one 
line change there was attractive. The risk here seems pretty low, at least for 
how SequenceFile uses this class.
   
   If we have concerns here, there are other possible more involved solutions 
we could discuss. For example, we could add something in SerializationFactory 
to add aliases. This would be more involved though because it'd require a 
slight refactor in SequenceFile and we'd have to make sure that new API worked 
for any other usages of SerializationFactory.
   
   That's why I chose the simple 1 liner approach, since it solves the problem 
with simplicity and minimal external impact.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to