Viraj Jasani created PHOENIX-7357:
-------------------------------------
Summary: New variable length binary data type: VARBINARY_ENCODED
Key: PHOENIX-7357
URL: https://issues.apache.org/jira/browse/PHOENIX-7357
Project: Phoenix
Issue Type: New Feature
Reporter: Viraj Jasani
Assignee: Viraj Jasani
Fix For: 5.3.0
As of today, Phoenix provides several variable length as well as fixed length
data types. One of the variable length data types is VARBINARY. It is variable
length binary blob. Using VARBINARY as only primary key can be considered as if
using HBase row key.
HBase provides a single row key. Any client application that requires using
more than one column for primary keys, using HBase requires special handling of
storing both column values as a single binary row key. Phoenix provides the
ability to use more than one primary key by providing composite primary keys.
Composite primary key can contain any number of primary key columns. Phoenix
also provides the ability to add new nullable primary key columns to the
existing composite primary keys. Phoenix uses HBase as its backing store. In
order to provide the ability for users to define multiple primary keys, Phoenix
internally concatenates binary encoded values of each primary key column value
and uses concatenated binary value as HBase row key. In order to efficiently
concatenate as well as retrieve individual primary key values, Phoenix
implements two ways:
# For fixed length columns: The length of the given column is determined by
the maximum length of the column. As part of the read flow, while iterating
through the row key, fixed length numbers of bytes are retrieved while reading.
While writing, if the original encoded value of the given column has less
number of bytes, additional null bytes (\x00) are padded until the fixed length
is filled up. Hence, for smaller values, we end up wasting some space.
# For variable length columns: Since we cannot know the length of the value of
variable length data type in advance, a separator or terminator byte is used.
Phoenix uses null byte as separator (\x00) byte. As of today, VARCHAR is the
most commonly used variable length data type and since VARCHAR represents
String, null byte is not part of valid String characters. Hence, it can be
effectively used to determine when to terminate the given VARCHAR value.
The null byte (\x00) works fine as a separator for VARCHAR. However, it cannot
be used as a separator byte for VARBINARY because VARBINARY can contain any
binary blob values. Due to this, Phoenix has restrictions for VARBINARY type:
# It can only be used as the last part of the composite primary key.
# It cannot be used as a DESC order primary key column.
Using VARBINARY data type as an earlier portion of the composite primary key is
a valid use case. One can also use multiple VARBINARY primary key columns.
After all, Phoenix provides the ability to use multiple primary key columns for
users.
Besides, using secondary index on data table means that the composite primary
key of secondary index table includes:
<secondary-index-col1> <secondary-index-col2> … <secondary-index-colN>
<primary-key-col1> <primary-key-col2> … <primary-key-colN>
As primary key columns are appended to the secondary indexes columns, one
cannot create a secondary index on any VARBINARY column.
The proposal of this Jira is to introduce new data type
{*}VARBINARY_ENCODED{*}, which has no restriction of being considered as
composite primary key prefix or using it as DESC ordered column.
This means, we need to effectively distinguish where the variable length binary
data terminates in the absence of fixed length information.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)