[
https://issues.apache.org/jira/browse/CASSANDRA-20102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Joey Lynch updated CASSANDRA-20102:
-----------------------------------
Test and Documentation Plan:
Unit and fuzz tests, will improve to quicktheories tests in next iteration.
Once I have rough consensus on the names will document in our functions docs.
Status: Patch Available (was: In Progress)
Trunk [patch|https://github.com/apache/cassandra/pull/3707] is fairly
straightforward.
I can backport to 5.0 and 4.1 if we want or happy to leave it trunk only.
> Support simple length functions for blob and string
> ---------------------------------------------------
>
> Key: CASSANDRA-20102
> URL: https://issues.apache.org/jira/browse/CASSANDRA-20102
> Project: Cassandra
> Issue Type: Improvement
> Components: CQL/Semantics
> Reporter: Joey Lynch
> Assignee: Joey Lynch
> Priority: Low
> Fix For: 5.x
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> A common question from users is how to query for the length of {{blob}} and
> {{text}} columns, and the answer is often for users to enable UDFs and then
> write their own
> ([1|https://stackoverflow.com/questions/49674114/cassandra-query-string-length],
>
> [2|https://stackoverflow.com/questions/58401750/how-can-i-get-size-of-blob-column-of-cassandra-using-user-defined-functions],
> ...). I think it would make sense for Cassandra to support a subset of the
> SQL99 [binary length
> functions|https://sql-99.readthedocs.io/en/latest/chapters/05.html#octet-length-expression]
> and a UTF8 codepoint aware {{length}} function similar to PostgreSQL's
> [{{length}}|https://www.postgresql.org/docs/9.1/functions-string.html]
> function.
> Specifically I propose two new system functions:
> # {{octet_length}} - Defined on all types and returns the number of
> underlying bytes in the bytebuffer representation of the column. Does _not_
> account for storage overhead (e.g. does not account for metadata)
> # {{length}} - Defined on UTF8 strings, and returns the number of UTF8
> codepoints in the string, equivalent to Java's
> [{{String.length}}|https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/lang/String.html#length()]
> function.
> Patch coming shortly. Particularly interested in feedback if we A) think
> these functions are reasonable to have inside Cassandra, and B) like the
> names of the functions to align with SQL or if we want to go nonstandard and
> do like {{blength}} instead of octet length.
> An example of the difference between octet length and string length
> {code}
> // A table with text key and blob value
> cqlsh> select key, value from test.test;
> key | value
> -------+--------
> hello | 0x12af
> cqlsh> select length(key), octet_length(value) from test.test;
> system.length(key) | system.octet_length(value)
> --------------------+---------------------------
> 5 | 2
> cqlsh> INSERT INTO test.test (key, value) VALUES ('こんにちは世界', 0x12AF);
> cqlsh> select length(key), octet_length(key), octet_length(value) from
> test.test where key = 'こんにちは世界';
> system.length(key) | system.octet_length(key) | system.octet_length(value)
> --------------------+-------------------------+---------------------------
> 7 | 21 | 2
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]