[ 
https://issues.apache.org/jira/browse/HADOOP-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12532292
 ] 

Milind Bhandarkar commented on HADOOP-1883:
-------------------------------------------

Vivek,

Can you please generate the patch from within the top hadoop dir ? I cannot 
apply this patch to trunk, because the filesnames all start with C:\eclipse... 
etc.



> Adding versioning to Record I/O
> -------------------------------
>
>                 Key: HADOOP-1883
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1883
>             Project: Hadoop
>          Issue Type: New Feature
>            Reporter: Vivek Ratan
>            Assignee: Vivek Ratan
>         Attachments: 1883_patch01, example.zip
>
>
> There is a need to add versioning support to Record I/O. Users frequently 
> update DDL files, usually by adding/removing fields, but do not want to 
> change the name of the data structure. They would like older & newer 
> deserializers to read as much data as possible. For example, suppose Record 
> I/O is used to serialize/deserialize log records, each of which contains a 
> message and a timestamp. An initial data definition could be as follows:
> {code}
> class MyLogRecord {
>   ustring msg;
>   long timestamp;
> }
> {code}
> Record I/O creates a class, _MyLogRecord_, which represents a log record and 
> can serialize/deserialize itself. Now, suppose newer log records additionally 
> contain a severity level. A user would want to update the definition for a 
> log record but use the same class name. The new definition would be:
> {code}
> class MyLogRecord {
>   ustring msg;
>   long timestamp;
>   int severity;
> }
> {code}
> Users would want a new deserializer to read old log records (and perhaps use 
> a default value for the severity field), and an old deserializer to read 
> newer log records (and skip the severity field).
> This requires some concept of versioning in Record I/O, or rather, the 
> additional ability to read/write type information of a record. The following 
> is a proposal to do this. 
> Every Record I/O Record will have type information which represents how the 
> record is structured (what fields it has, what types, etc.). This type 
> information, represented by the class _RecordTypeInfo_, is itself 
> serializable/deserializable. Every Record supports a method 
> _getRecordTypeInfo()_, which returns a _RecordTypeInfo_ object. Users are 
> expected to serialize this type information (by calling 
> _RecordTypeInfo.serialize()_) in an appropriate fashion (in a separate file, 
> for example, or at the beginning of a file). Using the same DDL as above, 
> here's how we could serialize log records: 
> {code}
> FileOutputStream fOut = new FileOutputStream("data.log");
> CsvRecordOutput csvOut = new CsvRecordOutput(fOut);
> ...
> // get the type information for MyLogRecord
> RecordTypeInfo typeInfo = MyLogRecord.getRecordTypeInfo();
> // ask it to write itself out
> typeInfo.serialize(csvOut);
> ...
> // now, serialize a bunch of records
> while (...) {
>    MyLogRecord log = new MyLogRecord();
>    // fill up the MyLogRecord object
>   ...
>   // serialize
>   log.serialize(csvOut);
> }
> {code}
> In this example, the type information of a Record is serialized fist, 
> followed by contents of various records, all into the same file. 
> Every Record also supports a method that allows a user to set a filter for 
> deserializing. A method _setRTIFilter()_ takes a _RecordTypeInfo_ object as a 
> parameter. This filter represents the type information of the data that is 
> being deserialized. When deserializing, the Record uses this filter (if one 
> is set) to figure out what to read. Continuing with our example, here's how 
> we could deserialize records:
> {code}
> FileInputStream fIn = new FileInputStream("data.log");
> // we know the record was written in CSV format
> CsvRecordInput csvIn = new CsvRecordInput(fIn);
> ...
> // we know the type info is written in the beginning. read it. 
> RecordTypeInfo typeInfoFilter = new RecordTypeInfo();
> // deserialize it
> typeInfoFilter.deserialize(csvIn);
> // let MyLogRecord know what to expect
> MyLogRecord.setRTIFilter(typeInfoFilter);
> // deserialize each record
> while (there is data in file) {
>   MyLogRecord log = new MyLogRecord();
>   log.read(csvIn);
>   ...
> }
> {code}
> The filter is optional. If not provided, the deserializer expects data to be 
> in the same format as it would serialize. (Note that a filter can also be 
> provided for serializing, forcing the serializer to write information in the 
> format of the filter, but there is no use case for this functionality yet). 
> What goes in the type information for a record? The type information for each 
> field in a Record is made up of:
>    1. a unique field ID, which is the field name. 
>    2. a type ID, which denotes the type of the field (int, string, map, etc). 
> The type information for a composite type contains type information for each 
> of its fields. This approach is somewhat similar to the one taken by 
> [Facebook's Thrift|http://developers.facebook.com/thrift/], as well as by 
> Google's Sawzall. The main difference is that we use field names as the field 
> ID, whereas Thrift and Sawzall use user-defined field numbers. While field 
> names take more space, they have the big advantage that there is no change to 
> support existing DDLs. 
> When deserializing, a Record looks at the filter and compares it with its own 
> set of {field name, field type} tuples. If there is a field in the data that 
> it doesn't know about it, it skips it (it knows how many bytes to skip, based 
> on the filter). If the deserialized data does not contain some field values, 
> the Record gives them default values. Additionally, we could allow users to 
> optionally specify default values in the DDL. The location of a field in a 
> structure does not matter. This lets us support reordering of fields. Note 
> that there is no change required to the DDL syntax, and very minimal changes 
> to client code (clients just need to read/write type information, in addition 
> to record data). 
> This scheme gives us an addition powerful feature: we can build a generic 
> serializer/deserializer, so that users can read all kinds of data without 
> having access to the original DDL or the original stubs. As long as you know 
> where the type information of a record is serialized, you can read all kinds 
> of data. One can also build a simple UI that displays the structure of data 
> serialized in any generic file. This is very useful for handling data across 
> lots of versions. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to