Max Moroz created SPARK-16324: --------------------------------- Summary: regexp_extract returns empty string when match fails Key: SPARK-16324 URL: https://issues.apache.org/jira/browse/SPARK-16324 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.0.0 Reporter: Max Moroz Priority: Minor
The documentation for regexp_extract isn't clear about how it should behave if the regex didn't match the row. However, the Java documentation it refers for further detail suggests that the return value should be null if the group wasn't matched at all, empty string is the group actually matched empty string, and an exception raised if the entire regex didn't match. This would be identical to how python's own re module behaves when a MatchObject.group() is called. However, in practice regexp_extract() returns empty string when the match fails. This seems to be a bug; if it was intended as a feature, it should have been documented as such - and it was probably not a good idea since it can result in silent bugs. {code} import pyspark.sql.functions as F df = spark.createDataFrame([['abc']], ['text']) assert df.select(F.regexp_extract('text', r'z', 1)).first()[0] == '' {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org