[
https://issues.apache.org/jira/browse/CASSANDRA-20102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Joey Lynch updated CASSANDRA-20102:
-----------------------------------
Description:
A common question from users is how to query for the length of {{blob}} and
{{text}} columns, and the answer is often for users to enable UDFs and then
write their own
([1|https://stackoverflow.com/questions/49674114/cassandra-query-string-length],
[2|https://stackoverflow.com/questions/58401750/how-can-i-get-size-of-blob-column-of-cassandra-using-user-defined-functions],
...). I think it would make sense for Cassandra to support a subset of the
SQL99 [binary length
functions|https://sql-99.readthedocs.io/en/latest/chapters/05.html#octet-length-expression]
and a UTF8 codepoint aware {{length}} function similar to PostgreSQL's
[{{length}}|https://www.postgresql.org/docs/9.1/functions-string.html] function.
Specifically I propose two new system functions:
# {{octetlength}} - Defined on all types and returns the number of underlying
bytes in the bytebuffer representation of the column. Does _not_ account for
storage overhead (e.g. does not account for metadata)
# {{length}} - Defined on UTF8 strings, and returns the number of UTF8
codepoints in the string, equivalent to Java's
[{{String.length}}|https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/lang/String.html#length()]
function.
Patch coming shortly. Particularly interested in feedback if we A) think these
functions are reasonable to have inside Cassandra, and B) like the names of the
functions to align with SQL or if we want to go nonstandard and do like
{{blength}} instead of octet length.
An example of the difference between octet length and string length
{code}
// A table with text key and blob value
cqlsh> select key, value from test.test;
key | value
-------+--------
hello | 0x12af
cqlsh> select length(key), octetlength(value) from test.test;
system.length(key) | system.octetlength(value)
--------------------+---------------------------
5 | 2
cqlsh> INSERT INTO test.test (key, value) VALUES ('こんにちは世界', 0x12AF);
cqlsh> select length(key), octetlength(key), octetlength(value) from test.test
where key = 'こんにちは世界';
system.length(key) | system.octetlength(key) | system.octetlength(value)
--------------------+-------------------------+---------------------------
7 | 21 | 2
{code}
was:
A common question from users is how to query for the length of {{blob}} and
{{text}} columns, and the answer is often for users to enable UDFs and then
write their own
([1|https://stackoverflow.com/questions/49674114/cassandra-query-string-length],
[2|https://stackoverflow.com/questions/58401750/how-can-i-get-size-of-blob-column-of-cassandra-using-user-defined-functions],
...). I think it would make sense for Cassandra to support a subset of the
SQL99 [binary length
functions|https://sql-99.readthedocs.io/en/latest/chapters/05.html#octet-length-expression]
and a UTF8 codepoint aware {{length}} function similar to PostgreSQL's
[{{length}}|https://www.postgresql.org/docs/9.1/functions-string.html] function.
Specifically I propose two new system functions:
# {{octetlength}} - Defined on all types and returns the number of underlying
bytes in the bytebuffer representation of the column.
# {{length}} - Defined on UTF8 strings, and returns the number of UTF8
codepoints in the string, equivalent to Java's
[{{String.length}}|https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/lang/String.html#length()]
function.
Patch coming shortly. Particularly interested in feedback if we A) think these
functions are reasonable to have inside Cassandra, and B) like the names of the
functions to align with SQL or if we want to go nonstandard and do like
{{blength}} instead of octet length.
An example of the difference between octet length and string length
{code}
// A table with text key and blob value
cqlsh> select key, value from test.test;
key | value
-------+--------
hello | 0x12af
cqlsh> select length(key), octetlength(value) from test.test;
system.length(key) | system.octetlength(value)
--------------------+---------------------------
5 | 2
cqlsh> INSERT INTO test.test (key, value) VALUES ('こんにちは世界', 0x12AF);
cqlsh> select length(key), octetlength(key), octetlength(value) from test.test
where key = 'こんにちは世界';
system.length(key) | system.octetlength(key) | system.octetlength(value)
--------------------+-------------------------+---------------------------
7 | 21 | 2
{code}
> Support simple length functions for blob and string
> ---------------------------------------------------
>
> Key: CASSANDRA-20102
> URL: https://issues.apache.org/jira/browse/CASSANDRA-20102
> Project: Cassandra
> Issue Type: Improvement
> Components: CQL/Semantics
> Reporter: Joey Lynch
> Assignee: Joey Lynch
> Priority: Normal
>
> A common question from users is how to query for the length of {{blob}} and
> {{text}} columns, and the answer is often for users to enable UDFs and then
> write their own
> ([1|https://stackoverflow.com/questions/49674114/cassandra-query-string-length],
>
> [2|https://stackoverflow.com/questions/58401750/how-can-i-get-size-of-blob-column-of-cassandra-using-user-defined-functions],
> ...). I think it would make sense for Cassandra to support a subset of the
> SQL99 [binary length
> functions|https://sql-99.readthedocs.io/en/latest/chapters/05.html#octet-length-expression]
> and a UTF8 codepoint aware {{length}} function similar to PostgreSQL's
> [{{length}}|https://www.postgresql.org/docs/9.1/functions-string.html]
> function.
> Specifically I propose two new system functions:
> # {{octetlength}} - Defined on all types and returns the number of underlying
> bytes in the bytebuffer representation of the column. Does _not_ account for
> storage overhead (e.g. does not account for metadata)
> # {{length}} - Defined on UTF8 strings, and returns the number of UTF8
> codepoints in the string, equivalent to Java's
> [{{String.length}}|https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/lang/String.html#length()]
> function.
> Patch coming shortly. Particularly interested in feedback if we A) think
> these functions are reasonable to have inside Cassandra, and B) like the
> names of the functions to align with SQL or if we want to go nonstandard and
> do like {{blength}} instead of octet length.
> An example of the difference between octet length and string length
> {code}
> // A table with text key and blob value
> cqlsh> select key, value from test.test;
> key | value
> -------+--------
> hello | 0x12af
> cqlsh> select length(key), octetlength(value) from test.test;
> system.length(key) | system.octetlength(value)
> --------------------+---------------------------
> 5 | 2
> cqlsh> INSERT INTO test.test (key, value) VALUES ('こんにちは世界', 0x12AF);
> cqlsh> select length(key), octetlength(key), octetlength(value) from
> test.test where key = 'こんにちは世界';
> system.length(key) | system.octetlength(key) | system.octetlength(value)
> --------------------+-------------------------+---------------------------
> 7 | 21 | 2
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]