[
https://issues.apache.org/jira/browse/HADOOP-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12555528#action_12555528
]
Hadoop QA commented on HADOOP-1883:
-----------------------------------
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12372354/1883_patch03
against trunk revision .
@author +1. The patch does not contain any @author tags.
javadoc +1. The javadoc tool did not generate any warning messages.
javac +1. The applied patch does not generate any new compiler warnings.
findbugs -1. The patch appears to introduce 9 new Findbugs warnings.
core tests +1. The patch passed core unit tests.
contrib tests -1. The patch failed contrib unit tests.
Test results:
http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1453/testReport/
Findbugs warnings:
http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1453/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results:
http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1453/artifact/trunk/build/test/checkstyle-errors.html
Console output:
http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1453/console
This message is automatically generated.
> Adding versioning to Record I/O
> -------------------------------
>
> Key: HADOOP-1883
> URL: https://issues.apache.org/jira/browse/HADOOP-1883
> Project: Hadoop
> Issue Type: New Feature
> Reporter: Vivek Ratan
> Assignee: Vivek Ratan
> Attachments: 1883_patch01, 1883_patch02, 1883_patch03, example.zip,
> example.zip
>
>
> There is a need to add versioning support to Record I/O. Users frequently
> update DDL files, usually by adding/removing fields, but do not want to
> change the name of the data structure. They would like older & newer
> deserializers to read as much data as possible. For example, suppose Record
> I/O is used to serialize/deserialize log records, each of which contains a
> message and a timestamp. An initial data definition could be as follows:
> {code}
> class MyLogRecord {
> ustring msg;
> long timestamp;
> }
> {code}
> Record I/O creates a class, _MyLogRecord_, which represents a log record and
> can serialize/deserialize itself. Now, suppose newer log records additionally
> contain a severity level. A user would want to update the definition for a
> log record but use the same class name. The new definition would be:
> {code}
> class MyLogRecord {
> ustring msg;
> long timestamp;
> int severity;
> }
> {code}
> Users would want a new deserializer to read old log records (and perhaps use
> a default value for the severity field), and an old deserializer to read
> newer log records (and skip the severity field).
> This requires some concept of versioning in Record I/O, or rather, the
> additional ability to read/write type information of a record. The following
> is a proposal to do this.
> Every Record I/O Record will have type information which represents how the
> record is structured (what fields it has, what types, etc.). This type
> information, represented by the class _RecordTypeInfo_, is itself
> serializable/deserializable. Every Record supports a method
> _getRecordTypeInfo()_, which returns a _RecordTypeInfo_ object. Users are
> expected to serialize this type information (by calling
> _RecordTypeInfo.serialize()_) in an appropriate fashion (in a separate file,
> for example, or at the beginning of a file). Using the same DDL as above,
> here's how we could serialize log records:
> {code}
> FileOutputStream fOut = new FileOutputStream("data.log");
> CsvRecordOutput csvOut = new CsvRecordOutput(fOut);
> ...
> // get the type information for MyLogRecord
> RecordTypeInfo typeInfo = MyLogRecord.getRecordTypeInfo();
> // ask it to write itself out
> typeInfo.serialize(csvOut);
> ...
> // now, serialize a bunch of records
> while (...) {
> MyLogRecord log = new MyLogRecord();
> // fill up the MyLogRecord object
> ...
> // serialize
> log.serialize(csvOut);
> }
> {code}
> In this example, the type information of a Record is serialized fist,
> followed by contents of various records, all into the same file.
> Every Record also supports a method that allows a user to set a filter for
> deserializing. A method _setRTIFilter()_ takes a _RecordTypeInfo_ object as a
> parameter. This filter represents the type information of the data that is
> being deserialized. When deserializing, the Record uses this filter (if one
> is set) to figure out what to read. Continuing with our example, here's how
> we could deserialize records:
> {code}
> FileInputStream fIn = new FileInputStream("data.log");
> // we know the record was written in CSV format
> CsvRecordInput csvIn = new CsvRecordInput(fIn);
> ...
> // we know the type info is written in the beginning. read it.
> RecordTypeInfo typeInfoFilter = new RecordTypeInfo();
> // deserialize it
> typeInfoFilter.deserialize(csvIn);
> // let MyLogRecord know what to expect
> MyLogRecord.setRTIFilter(typeInfoFilter);
> // deserialize each record
> while (there is data in file) {
> MyLogRecord log = new MyLogRecord();
> log.read(csvIn);
> ...
> }
> {code}
> The filter is optional. If not provided, the deserializer expects data to be
> in the same format as it would serialize. (Note that a filter can also be
> provided for serializing, forcing the serializer to write information in the
> format of the filter, but there is no use case for this functionality yet).
> What goes in the type information for a record? The type information for each
> field in a Record is made up of:
> 1. a unique field ID, which is the field name.
> 2. a type ID, which denotes the type of the field (int, string, map, etc).
> The type information for a composite type contains type information for each
> of its fields. This approach is somewhat similar to the one taken by
> [Facebook's Thrift|http://developers.facebook.com/thrift/], as well as by
> Google's Sawzall. The main difference is that we use field names as the field
> ID, whereas Thrift and Sawzall use user-defined field numbers. While field
> names take more space, they have the big advantage that there is no change to
> support existing DDLs.
> When deserializing, a Record looks at the filter and compares it with its own
> set of {field name, field type} tuples. If there is a field in the data that
> it doesn't know about it, it skips it (it knows how many bytes to skip, based
> on the filter). If the deserialized data does not contain some field values,
> the Record gives them default values. Additionally, we could allow users to
> optionally specify default values in the DDL. The location of a field in a
> structure does not matter. This lets us support reordering of fields. Note
> that there is no change required to the DDL syntax, and very minimal changes
> to client code (clients just need to read/write type information, in addition
> to record data).
> This scheme gives us an addition powerful feature: we can build a generic
> serializer/deserializer, so that users can read all kinds of data without
> having access to the original DDL or the original stubs. As long as you know
> where the type information of a record is serialized, you can read all kinds
> of data. One can also build a simple UI that displays the structure of data
> serialized in any generic file. This is very useful for handling data across
> lots of versions.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.