Nicholas Chammas created SPARK-3354: ---------------------------------------
Summary: Add LENGTH and DATALENGTH functions to Spark SQL Key: SPARK-3354 URL: https://issues.apache.org/jira/browse/SPARK-3354 Project: Spark Issue Type: Improvement Components: SQL Reporter: Nicholas Chammas It's very common when working on data sets of strings to want to know the lengths of the strings you are analyzing. Say I have some Tweets and want to find the average length of a Tweet by language. {code} SELECT language, AVG(LEN(tweet)) AS avg_length FROM tweets GROUP BY language ORDER BY avg_length DESC; {code} This is currently not possible because Spark SQL doesn't have a {{LEN()}} function. Another common function that would be useful is one that gives the size of the data item in bytes. This can be handy when moving data from Spark SQL to another system and you need to know how to size the receiving fields appropriately. *Proposal* * Add a {{LENGTH}} function. Make {{LEN}} a synonym of {{LENGTH}}. This function returns the number of characters in a string expression. * Add a {{DATALENGTH}} function. Make {{DATALEN}} a synonym of {{DATALENGTH}}. This function returns the number of bytes in any expression. *Special care* must be given to the following cases: * multi-byte characters * {{NULL}} * trailing spaces *Examples* These are suggestions for how these 2 functions should work. {code} LEN('Hello') -> 5 LEN('안녕') -> 2 LEN('Hello 안녕') -> 8 LEN(NULL) -> NULL LEN('') -> 0 LEN('Bob ') -> 3 {code} In this last example with {{'Bob '}}, trailing spaces are ignored. This matches the [behavior of SQL Server|http://msdn.microsoft.com/en-us/library/ms190329.aspx], but we could opt to include the spaces. {code} DATALEN('Hello') -> 5 DATALEN('안녕') -> 4 DATALEN('Hello 안녕') -> 16 DATALEN(NULL) -> NULL DATALEN('') -> 0 DATALEN('Bob ') -> 5 {code} Note here how mixing English and Korean characters causes every character to be interpreted as a 2 byte wide character. Dunno if this sane; this may depend on Scala or JVM details that I wouldn't know about at the moment. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org