Viraj Jasani created PHOENIX-7357: ------------------------------------- Summary: New variable length binary data type: VARBINARY_ENCODED Key: PHOENIX-7357 URL: https://issues.apache.org/jira/browse/PHOENIX-7357 Project: Phoenix Issue Type: New Feature Reporter: Viraj Jasani Assignee: Viraj Jasani Fix For: 5.3.0
As of today, Phoenix provides several variable length as well as fixed length data types. One of the variable length data types is VARBINARY. It is variable length binary blob. Using VARBINARY as only primary key can be considered as if using HBase row key. HBase provides a single row key. Any client application that requires using more than one column for primary keys, using HBase requires special handling of storing both column values as a single binary row key. Phoenix provides the ability to use more than one primary key by providing composite primary keys. Composite primary key can contain any number of primary key columns. Phoenix also provides the ability to add new nullable primary key columns to the existing composite primary keys. Phoenix uses HBase as its backing store. In order to provide the ability for users to define multiple primary keys, Phoenix internally concatenates binary encoded values of each primary key column value and uses concatenated binary value as HBase row key. In order to efficiently concatenate as well as retrieve individual primary key values, Phoenix implements two ways: # For fixed length columns: The length of the given column is determined by the maximum length of the column. As part of the read flow, while iterating through the row key, fixed length numbers of bytes are retrieved while reading. While writing, if the original encoded value of the given column has less number of bytes, additional null bytes (\x00) are padded until the fixed length is filled up. Hence, for smaller values, we end up wasting some space. # For variable length columns: Since we cannot know the length of the value of variable length data type in advance, a separator or terminator byte is used. Phoenix uses null byte as separator (\x00) byte. As of today, VARCHAR is the most commonly used variable length data type and since VARCHAR represents String, null byte is not part of valid String characters. Hence, it can be effectively used to determine when to terminate the given VARCHAR value. The null byte (\x00) works fine as a separator for VARCHAR. However, it cannot be used as a separator byte for VARBINARY because VARBINARY can contain any binary blob values. Due to this, Phoenix has restrictions for VARBINARY type: # It can only be used as the last part of the composite primary key. # It cannot be used as a DESC order primary key column. Using VARBINARY data type as an earlier portion of the composite primary key is a valid use case. One can also use multiple VARBINARY primary key columns. After all, Phoenix provides the ability to use multiple primary key columns for users. Besides, using secondary index on data table means that the composite primary key of secondary index table includes: <secondary-index-col1> <secondary-index-col2> … <secondary-index-colN> <primary-key-col1> <primary-key-col2> … <primary-key-colN> As primary key columns are appended to the secondary indexes columns, one cannot create a secondary index on any VARBINARY column. The proposal of this Jira is to introduce new data type {*}VARBINARY_ENCODED{*}, which has no restriction of being considered as composite primary key prefix or using it as DESC ordered column. This means, we need to effectively distinguish where the variable length binary data terminates in the absence of fixed length information. -- This message was sent by Atlassian Jira (v8.20.10#820010)