[
https://issues.apache.org/jira/browse/HBASE-8693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13711775#comment-13711775
]
Nick Dimiduk commented on HBASE-8693:
-------------------------------------
bq. Ok, make sense with this limited scope (no schema) have a fixed list of
fields.
Right. In this implementation Struct is a simple concatenation of fields. No
schema information is written into that concatenation because to do so will
mess with sort order. Struct is merely API convenience. Now, the field
encodings implemented in OrderedBytes include a header byte which is currently
used to identify the type of encoded field that follows. The full space of 256
available bit patterns in that header bit is not consumed by the current
implementation. I've been thinking about extending that header byte to include
some version bits at the very beginning. That would enable evolution of the
individual field encodings (say, if you later want to re-implement blob-mid,
for example). This doesn't address the user-level logical structure of a Struct
data type, only evolution of the OrderedBytes codec.
bq. My main concern is: I start use 96 with this struct encoding... is fixed so
I can't add fields.. so I work around it adding a version number in front of
the struct and then I do the switch for v1, v2, v3 with all the fixed struct
that I know...
Prepending a version number to the Struct's members will impact sort order.
Struct definition is fixed in that you can't prepend or interpose a new field
in the middle of an existing encoded value. You're free to append fields.
Appending a field would look like the following:
# application defines Struct v0 with members [A,B,C]
# application writes lots of data
# application changes, Struct v1 becomes [A,B,C,D,E]
# application writes lots more data
At step 3, the application now needs to become version aware. Because the
fields of v0 are a subset of v1, the application can use the definition of
struct v1 with the following safe-guards. (1) Any place where v0 was used, it
now needs to be sure to check for end-of-buffer and skip over the two new
elements. (2) Anywhere v1 is used, mindful of truncated records and be prepared
to only receive the v0 fields. Maybe the API defined around Struct can be
improved to support these needs?
Records of v0 and v1 can be intermixed, ie, as rowkeys in the same table.
According to the documented sort semantics, they'll sort "left-to-right and
depth-first". Meaning, they'll sort first according to v0 values and then
within that group, by v1 values.
We leave all of this up to user applications today, so this change management
isn't mitigated. Changing a compound rowkey today requires rewriting data (or
duplication into a new table). A smarter struct encoding, one that's able to
preserve the sorted semantics I've described but that can also track more
sophisticated schama change would be very useful indeed -- I don't think it
exists.
Prepending a version field to a Struct will change the sorting behavior; v0
will sort before v1, &c. IMHO, this is a less flexible migration strategy than
the append behavior described above. It's also perfectly valid, and the user of
the Struct API is free to do so in their own application. In that case, the
application is still version-aware. Instead of being cautious about consuming
the potentially truncated records, instead it's executing a scan for each
version.
bq. as you said, data evolution is out of the scope. so if you consider this
patch just as a "smarter" alternative to the Bytes encoding.
HBASE-8201 is a smarter alternative to Bytes and this ticket adds some
higher-level APIs for manipulating them. In short, yes, schema definition and
evolution is out of scope.
> Implement extensible type API based on serialization primitives
> ---------------------------------------------------------------
>
> Key: HBASE-8693
> URL: https://issues.apache.org/jira/browse/HBASE-8693
> Project: HBase
> Issue Type: Sub-task
> Components: Client
> Reporter: Nick Dimiduk
> Assignee: Nick Dimiduk
> Fix For: 0.95.2
>
> Attachments: 0001-HBASE-8693-Extensible-data-types-API.patch,
> 0001-HBASE-8693-Extensible-data-types-API.patch,
> 0001-HBASE-8693-Extensible-data-types-API.patch,
> 0002-HBASE-8693-example-Use-DataType-API-to-build-regionN.patch
>
>
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira