Dejiu Lu created SPARK-49968: -------------------------------- Summary: The split function produces incorrect results with an empty regex and a limit Key: SPARK-49968 URL: https://issues.apache.org/jira/browse/SPARK-49968 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.5.1 Reporter: Dejiu Lu
The current behavior of the split function is as follows: {code:java} select split('hello', 'h', 1) // result is ["hello"] select split('hello', '-', 1) // result is ["hello"] select split('hello', '', 1) // result is ["h"] select split('1A2A3A4', 'A', 3) // result is ["1","2","3A4"] select split('1A2A3A4', '', 3) // result is ["1","A","2"]{code} However, according to the function's description, when the limit is greater than zero, the last element of the split result should contain the remaining part of the input string. {code:java} Arguments: * str - a string expression to split. * regex - a string representing a regular expression. The regex string should be a Java regular expression. * limit - an integer expression which controls the number of times the regex is applied. * limit > 0: The resulting array's length will not be more than `limit`, and the resulting array's last entry will contain all input beyond the last matched regex. * limit <= 0: `regex` will be applied as many times as possible, and the resulting array can be of any size. {code} So, the split function produces incorrect results with an empty regex and a limit. The correct result should be: {code:java} select split('hello', '', 1) // result is ["hello"] select split('1A2A3A4', '', 3) // result is ["1","A","2A3A4"]{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org