[ https://issues.apache.org/jira/browse/SPARK-48091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ron Serruya updated SPARK-48091: -------------------------------- Description: When using an `explode` function, and `transform` function in the same select statement, aliases used inside the transformed column are ignored. This behaviour only happens using the pyspark API, and not when using the SQL API {code:java} from pyspark.sql import functions as F # Create the df df = spark.createDataFrame([ {"id": 1, "array1": ['a', 'b'], 'array2': [2,3,4]} ]){code} Good case, where all aliases are used {code:java} df.select( F.transform( 'array2', lambda x: F.struct(x.alias("some_alias"), F.col("id").alias("second_alias")) ).alias("new_array2") ).printSchema() root |-- new_array2: array (nullable = true) | |-- element: struct (containsNull = false) | | |-- some_alias: long (nullable = true) | | |-- second_alias: long (nullable = true){code} Bad case, when using explode, the alises inside the transformed column is ignored, and `id` is kept instead of `second_alias`, and `x_17` is used instead of `some_alias` {code:java} df.select( F.explode("array1").alias("exploded"), F.transform( 'array2', lambda x: F.struct(x.alias("some_alias"), F.col("id").alias("second_alias")) ).alias("new_array2") ).printSchema() root |-- exploded: string (nullable = true) |-- new_array2: array (nullable = true) | |-- element: struct (containsNull = false) | | |-- x_17: long (nullable = true) | | |-- id: long (nullable = true) {code} When using the SQL API instead, it works fine {code:java} spark.sql( """ select explode(array1) as exploded, transform(array2, x-> struct(x as some_alias, id as second_alias)) as array2 from {df} """, df=df ).printSchema() root |-- exploded: string (nullable = true) |-- array2: array (nullable = true) | |-- element: struct (containsNull = false) | | |-- some_alias: long (nullable = true) | | |-- second_alias: long (nullable = true) {code} Workaround: for now, using F.named_struct can be used as a workaround was: When using an `explode` function, and `transform` function in the same select statement, aliases used inside the transformed column are ignored. This behaviour only happens using the pyspark API, and not when using the SQL API {code:java} from pyspark.sql import functions as F # Create the df df = spark.createDataFrame([ {"id": 1, "array1": ['a', 'b'], 'array2': [2,3,4]} ]){code} Good case, where all aliases are used {code:java} df.select( F.transform( 'array2', lambda x: F.struct(x.alias("some_alias"), F.col("id").alias("second_alias")) ).alias("new_array2") ).printSchema() root |-- new_array2: array (nullable = true) | |-- element: struct (containsNull = false) | | |-- some_alias: long (nullable = true) | | |-- second_alias: long (nullable = true){code} Bad case, when using explode, the alises inside the transformed column is ignored, and `id` is kept instead of `second_alias`, and `x_17` is used instead of `some_alias` {code:java} df.select( F.explode("array1").alias("exploded"), F.transform( 'array2', lambda x: F.struct(x.alias("some_alias"), F.col("id").alias("second_alias")) ).alias("new_array2") ).printSchema() root |-- exploded: string (nullable = true) |-- new_array2: array (nullable = true) | |-- element: struct (containsNull = false) | | |-- x_17: long (nullable = true) | | |-- id: long (nullable = true) {code} When using the SQL API instead, it works fine {code:java} spark.sql( """ select explode(array1) as exploded, transform(array2, x-> struct(x as some_alias, id as second_alias)) as array2 from {df} """, df=df ).printSchema() root |-- exploded: string (nullable = true) |-- array2: array (nullable = true) | |-- element: struct (containsNull = false) | | |-- some_alias: long (nullable = true) | | |-- second_alias: long (nullable = true) {code} > Using `explode` together with `transform` in the same select statement causes > aliases in the transformed column to be ignored > ----------------------------------------------------------------------------------------------------------------------------- > > Key: SPARK-48091 > URL: https://issues.apache.org/jira/browse/SPARK-48091 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 3.4.0, 3.5.0, 3.5.1 > Environment: Python 3.10, 3.12, OSX 14.4 and Databricks DBR 13.3, > 14.3, Pyspark 3.4.0, 3.5.0, 3.5.1 > Reporter: Ron Serruya > Priority: Minor > Labels: PySpark, alias > > When using an `explode` function, and `transform` function in the same select > statement, aliases used inside the transformed column are ignored. > This behaviour only happens using the pyspark API, and not when using the SQL > API > > {code:java} > from pyspark.sql import functions as F > # Create the df > df = spark.createDataFrame([ > {"id": 1, "array1": ['a', 'b'], 'array2': [2,3,4]} > ]){code} > Good case, where all aliases are used > > {code:java} > df.select( > F.transform( > 'array2', > lambda x: F.struct(x.alias("some_alias"), > F.col("id").alias("second_alias")) > ).alias("new_array2") > ).printSchema() > root > |-- new_array2: array (nullable = true) > | |-- element: struct (containsNull = false) > | | |-- some_alias: long (nullable = true) > | | |-- second_alias: long (nullable = true){code} > Bad case, when using explode, the alises inside the transformed column is > ignored, and `id` is kept instead of `second_alias`, and `x_17` is used > instead of `some_alias` > > > {code:java} > df.select( > F.explode("array1").alias("exploded"), > F.transform( > 'array2', > lambda x: F.struct(x.alias("some_alias"), > F.col("id").alias("second_alias")) > ).alias("new_array2") > ).printSchema() > root > |-- exploded: string (nullable = true) > |-- new_array2: array (nullable = true) > | |-- element: struct (containsNull = false) > | | |-- x_17: long (nullable = true) > | | |-- id: long (nullable = true) {code} > > > > When using the SQL API instead, it works fine > {code:java} > spark.sql( > """ > select explode(array1) as exploded, transform(array2, x-> struct(x as > some_alias, id as second_alias)) as array2 from {df} > """, df=df > ).printSchema() > root > |-- exploded: string (nullable = true) > |-- array2: array (nullable = true) > | |-- element: struct (containsNull = false) > | | |-- some_alias: long (nullable = true) > | | |-- second_alias: long (nullable = true) {code} > > Workaround: for now, using F.named_struct can be used as a workaround -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org