Nicholas Chammas created SPARK-3354:
---------------------------------------

             Summary: Add LENGTH and DATALENGTH functions to Spark SQL
                 Key: SPARK-3354
                 URL: https://issues.apache.org/jira/browse/SPARK-3354
             Project: Spark
          Issue Type: Improvement
          Components: SQL
            Reporter: Nicholas Chammas


It's very common when working on data sets of strings to want to know the 
lengths of the strings you are analyzing. Say I have some Tweets and want to 
find the average length of a Tweet by language.

{code}
SELECT language, AVG(LEN(tweet)) AS avg_length
FROM tweets
GROUP BY language
ORDER BY avg_length DESC;
{code}

This is currently not possible because Spark SQL doesn't have a {{LEN()}} 
function.

Another common function that would be useful is one that gives the size of the 
data item in bytes. This can be handy when moving data from Spark SQL to 
another system and you need to know how to size the receiving fields 
appropriately.

*Proposal*

* Add a {{LENGTH}} function. Make {{LEN}} a synonym of {{LENGTH}}. This 
function returns the number of characters in a string expression.
* Add a {{DATALENGTH}} function. Make {{DATALEN}} a synonym of {{DATALENGTH}}. 
This function returns the number of bytes in any expression.

*Special care* must be given to the following cases:
* multi-byte characters
* {{NULL}}
* trailing spaces

*Examples*
These are suggestions for how these 2 functions should work.
{code}
LEN('Hello') -> 5
LEN('안녕') -> 2
LEN('Hello 안녕') -> 8
LEN(NULL) -> NULL
LEN('') -> 0
LEN('Bob  ') -> 3
{code}

In this last example with {{'Bob  '}}, trailing spaces are ignored. This 
matches the [behavior of SQL 
Server|http://msdn.microsoft.com/en-us/library/ms190329.aspx], but we could opt 
to include the spaces.

{code}
DATALEN('Hello') -> 5
DATALEN('안녕') -> 4
DATALEN('Hello 안녕') -> 16
DATALEN(NULL) -> NULL
DATALEN('') -> 0
DATALEN('Bob  ') -> 5
{code}

Note here how mixing English and Korean characters causes every character to be 
interpreted as a 2 byte wide character. Dunno if this sane; this may depend on 
Scala or JVM details that I wouldn't know about at the moment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to