Tabrez Mohammed created SPARK-48868: ---------------------------------------
Summary: Incorrect AnalysisException thrown using when() and mixed data types Key: SPARK-48868 URL: https://issues.apache.org/jira/browse/SPARK-48868 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.4.0, 3.3.0 Reporter: Tabrez Mohammed Observe the following sample code, where I'm using when() based on typeof(): {code:java} from pyspark.sql.types import ( ArrayType, IntegerType, StringType, StructField, StructType, ) schema = StructType( [ StructField("ID", IntegerType(), nullable=False), StructField("name", StringType(), nullable=False), StructField("colors", ArrayType(StringType()), nullable=False), ] ) data = [ (1, "John", ["red", "blue", "green"]), (2, "Jane", ["yellow", "orange", "purple"]), (3, "Bob", ["black", "white"]), (4, "Alice", ["pink"]), (5, "Tom", ["brown", "gray"]), ] df = spark.createDataFrame(data, schema) col = "name" df = df.withColumn( col, F.when(F.expr(f"typeof({col}) == 'string'"), F.trim(col)) .when( F.expr(f"typeof({col}) LIKE 'array%'"), F.array_join(col, ","), ) .otherwise(F.lit(None)), ) {code} Here's the exception I'm seeing: {noformat} pyspark.sql.utils.AnalysisException: cannot resolve 'array_join(name, ',')' due to data type mismatch: argument 1 requires array<string> type, however, 'name' is of string type.; 'Project [ID#0, CASE WHEN (typeof(name#1) = string) THEN trim(name#1, None) WHEN typeof(name#1) LIKE array% THEN array_join(name#1, ,, None) ELSE null END AS name#6, colors#2] +- LogicalRDD [ID#0, name#1, colors#2], false {noformat} If I change col to "colors", I get this similar exception: {noformat} pyspark.sql.utils.AnalysisException: cannot resolve 'trim(colors)' due to data type mismatch: argument 1 requires string type, however, 'colors' is of array<string> type.; 'Project [ID#0, name#1, CASE WHEN (typeof(colors#2) = string) THEN trim(colors#2, None) WHEN typeof(colors#2) LIKE array% THEN array_join(colors#2, ,, None) ELSE null END AS colors#6] +- LogicalRDD [ID#0, name#1, colors#2], false {noformat} It seems to try to evaluate all possible paths of code for type checking, even if that code path won't be hit for the current query. I was able to repro this on 3.3.0 and 3.4.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org