[jira] [Commented] (HBASE-8693) Implement extensible type API based on serialization primitives

Nick Dimiduk (JIRA) Wed, 17 Jul 2013 16:07:02 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-8693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13711775#comment-13711775
 ]


Nick Dimiduk commented on HBASE-8693:
-------------------------------------

bq. Ok, make sense with this limited scope (no schema) have a fixed list of 
fields.

Right. In this implementation Struct is a simple concatenation of fields. No 
schema information is written into that concatenation because to do so will 
mess with sort order. Struct is merely API convenience. Now, the field 
encodings implemented in OrderedBytes include a header byte which is currently 
used to identify the type of encoded field that follows. The full space of 256 
available bit patterns in that header bit is not consumed by the current 
implementation. I've been thinking about extending that header byte to include 
some version bits at the very beginning. That would enable evolution of the 
individual field encodings (say, if you later want to re-implement blob-mid, 
for example). This doesn't address the user-level logical structure of a Struct 
data type, only evolution of the OrderedBytes codec.

bq. My main concern is: I start use 96 with this struct encoding... is fixed so 
I can't add fields.. so I work around it adding a version number in front of 
the struct and then I do the switch for v1, v2, v3 with all the fixed struct 
that I know...

Prepending a version number to the Struct's members will impact sort order. 
Struct definition is fixed in that you can't prepend or interpose a new field 
in the middle of an existing encoded value. You're free to append fields. 
Appending a field would look like the following:

 # application defines Struct v0 with members [A,B,C]
 # application writes lots of data
 # application changes, Struct v1 becomes [A,B,C,D,E]
 # application writes lots more data

At step 3, the application now needs to become version aware. Because the 
fields of v0 are a subset of v1, the application can use the definition of 
struct v1 with the following safe-guards. (1) Any place where v0 was used, it 
now needs to be sure to check for end-of-buffer and skip over the two new 
elements. (2) Anywhere v1 is used, mindful of truncated records and be prepared 
to only receive the v0 fields. Maybe the API defined around Struct can be 
improved to support these needs?

Records of v0 and v1 can be intermixed, ie, as rowkeys in the same table. 
According to the documented sort semantics, they'll sort "left-to-right and 
depth-first". Meaning, they'll sort first according to v0 values and then 
within that group, by v1 values.

We leave all of this up to user applications today, so this change management 
isn't mitigated. Changing a compound rowkey today requires rewriting data (or 
duplication into a new table). A smarter struct encoding, one that's able to 
preserve the sorted semantics I've described but that can also track more 
sophisticated schama change would be very useful indeed -- I don't think it 
exists.

Prepending a version field to a Struct will change the sorting behavior; v0 
will sort before v1, &c. IMHO, this is a less flexible migration strategy than 
the append behavior described above. It's also perfectly valid, and the user of 
the Struct API is free to do so in their own application. In that case, the 
application is still version-aware. Instead of being cautious about consuming 
the potentially truncated records, instead it's executing a scan for each 
version.

bq. as you said, data evolution is out of the scope. so if you consider this 
patch just as a "smarter" alternative to the Bytes encoding.

HBASE-8201 is a smarter alternative to Bytes and this ticket adds some 
higher-level APIs for manipulating them. In short, yes, schema definition and 
evolution is out of scope.
                
> Implement extensible type API based on serialization primitives
> ---------------------------------------------------------------
>
>                 Key: HBASE-8693
>                 URL: https://issues.apache.org/jira/browse/HBASE-8693
>             Project: HBase
>          Issue Type: Sub-task
>          Components: Client
>            Reporter: Nick Dimiduk
>            Assignee: Nick Dimiduk
>             Fix For: 0.95.2
>
>         Attachments: 0001-HBASE-8693-Extensible-data-types-API.patch, 
> 0001-HBASE-8693-Extensible-data-types-API.patch, 
> 0001-HBASE-8693-Extensible-data-types-API.patch, 
> 0002-HBASE-8693-example-Use-DataType-API-to-build-regionN.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-8693) Implement extensible type API based on serialization primitives

Reply via email to