Tabrez Mohammed created SPARK-48868:
---------------------------------------

             Summary: Incorrect AnalysisException thrown using when() and mixed 
data types
                 Key: SPARK-48868
                 URL: https://issues.apache.org/jira/browse/SPARK-48868
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 3.4.0, 3.3.0
            Reporter: Tabrez Mohammed


Observe the following sample code, where I'm using when() based on typeof():
{code:java}
from pyspark.sql.types import (
    ArrayType,
    IntegerType,
    StringType,
    StructField,
    StructType,
)

schema = StructType(
    [
        StructField("ID", IntegerType(), nullable=False),
        StructField("name", StringType(), nullable=False),
        StructField("colors", ArrayType(StringType()), nullable=False),
    ]
)
data = [
    (1, "John", ["red", "blue", "green"]),
    (2, "Jane", ["yellow", "orange", "purple"]),
    (3, "Bob", ["black", "white"]),
    (4, "Alice", ["pink"]),
    (5, "Tom", ["brown", "gray"]),
]
df = spark.createDataFrame(data, schema)
col = "name"
df = df.withColumn(
    col,
    F.when(F.expr(f"typeof({col}) == 'string'"), F.trim(col))
    .when(
        F.expr(f"typeof({col}) LIKE 'array%'"),
        F.array_join(col, ","),
    )
    .otherwise(F.lit(None)),
)
{code}
 
Here's the exception I'm seeing:
{noformat}
pyspark.sql.utils.AnalysisException: cannot resolve 'array_join(name, ',')' due 
to data type mismatch: argument 1 requires array<string> type, however, 'name' 
is of string type.;
'Project [ID#0, CASE WHEN (typeof(name#1) = string) THEN trim(name#1, None) 
WHEN typeof(name#1) LIKE array% THEN array_join(name#1, ,, None) ELSE null END 
AS name#6, colors#2]
+- LogicalRDD [ID#0, name#1, colors#2], false
{noformat}
 

If I change col to "colors", I get this similar exception:
{noformat}
pyspark.sql.utils.AnalysisException: cannot resolve 'trim(colors)' due to data 
type mismatch: argument 1 requires string type, however, 'colors' is of 
array<string> type.;
'Project [ID#0, name#1, CASE WHEN (typeof(colors#2) = string) THEN 
trim(colors#2, None) WHEN typeof(colors#2) LIKE array% THEN 
array_join(colors#2, ,, None) ELSE null END AS colors#6]
+- LogicalRDD [ID#0, name#1, colors#2], false
{noformat}
 

It seems to try to evaluate all possible paths of code for type checking, even 
if that code path won't be hit for the current query. I was able to repro this 
on 3.3.0 and 3.4.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to