Ron Serruya created SPARK-48091:
-----------------------------------

             Summary: Using `explode` together with `transform` in the same 
select statement causes aliases in the transformed column to be ignored
                 Key: SPARK-48091
                 URL: https://issues.apache.org/jira/browse/SPARK-48091
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 3.5.1, 3.5.0, 3.4.0
         Environment: Python 3.10, 3.12, OSX 14.4 and Databricks DBR 13.3, 
14.3, Pyspark 3.4.0, 3.5.0, 3.5.1
            Reporter: Ron Serruya


When using an `explode` function, and `transform` function in the same select 
statement, aliases used inside the transformed column are ignored.

This behaviour only happens using the pyspark API, and not when using the SQL 
API

 
{code:java}
from pyspark.sql import functions as F

# Create the df
df = spark.createDataFrame([
    {"id": 1, "array1": ['a', 'b'], 'array2': [2,3,4]}
]){code}
Good case, where all aliases are used

 
{code:java}
df.select(
    F.transform(
        'array2',
        lambda x: F.struct(x.alias("some_alias"), 
F.col("id").alias("second_alias"))
    ).alias("new_array2")
).printSchema() 

root
 |-- new_array2: array (nullable = true)
 |    |-- element: struct (containsNull = false)
 |    |    |-- some_alias: long (nullable = true)
 |    |    |-- second_alias: long (nullable = true){code}
Bad case, when using explode, the alises inside the transformed column is 
ignored, and  `id` is kept instead of `second_alias`, and `x_17` is used 
instead of `some_alias`

 

 
{code:java}
df.select(
    F.explode("array1").alias("exploded"),
    F.transform(
        'array2',
        lambda x: F.struct(x.alias("some_alias"), 
F.col("id").alias("second_alias"))
    ).alias("new_array2")
).printSchema()

root
 |-- exploded: string (nullable = true)
 |-- new_array2: array (nullable = true)
 |    |-- element: struct (containsNull = false)
 |    |    |-- x_17: long (nullable = true)
 |    |    |-- id: long (nullable = true) {code}
 

 

 

When using the SQL API instead, it works fine
{code:java}
spark.sql(
    """
    select explode(array1) as exploded, transform(array2, x-> struct(x as 
some_alias, id as second_alias)) as array2 from {df}
    """, df=df
).printSchema()

root
 |-- exploded: string (nullable = true)
 |-- array2: array (nullable = true)
 |    |-- element: struct (containsNull = false)
 |    |    |-- some_alias: long (nullable = true)
 |    |    |-- second_alias: long (nullable = true) {code}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to