[
https://issues.apache.org/jira/browse/SPARK-48091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18007808#comment-18007808
]
André Souprayane edited comment on SPARK-48091 at 7/18/25 7:14 AM:
-------------------------------------------------------------------
I still have the issue with master branch:
{code:java}
// code placeholder
scala> var df2 = df.select(array(lit(1), lit(2), lit(3)).as("my_array"),
array(lit(1), lit(2), lit(3)).as("my_array2"))
var df2: org.apache.spark.sql.DataFrame = [my_array: array<int>, my_array2:
array<int>]
scala> df2.select(
| explode($"my_array").as("exploded"),
| transform($"my_array2", => struct(x.as("data"))).as("my_struct")
| ).printSchema
warning: 1 deprecation (since 2.13.3); for details, enable `:setting
-deprecation` or `:replay -deprecation`
root
|-- exploded: integer (nullable = false)
|-- my_struct: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- x_1: integer (nullable = false)
scala> spark.version
val res3: String = 4.1.0-SNAPSHOT {code}
was (Author: JIRAUSER310381):
I have still have the issue with master branch:
{code:java}
// code placeholder
scala> var df2 = df.select(array(lit(1), lit(2), lit(3)).as("my_array"),
array(lit(1), lit(2), lit(3)).as("my_array2"))
var df2: org.apache.spark.sql.DataFrame = [my_array: array<int>, my_array2:
array<int>]
scala> df2.select(
| explode($"my_array").as("exploded"),
| transform($"my_array2", => struct(x.as("data"))).as("my_struct")
| ).printSchema
warning: 1 deprecation (since 2.13.3); for details, enable `:setting
-deprecation` or `:replay -deprecation`
root
|-- exploded: integer (nullable = false)
|-- my_struct: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- x_1: integer (nullable = false)
scala> spark.version
val res3: String = 4.1.0-SNAPSHOT {code}
> Using `explode` together with `transform` in the same select statement causes
> aliases in the transformed column to be ignored
> -----------------------------------------------------------------------------------------------------------------------------
>
> Key: SPARK-48091
> URL: https://issues.apache.org/jira/browse/SPARK-48091
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 3.4.0, 3.5.0, 3.5.1
> Environment: Scala 2.12.15, Python 3.10, 3.12, OSX 14.4 and
> Databricks DBR 13.3, 14.3, Pyspark 3.4.0, 3.5.0, 3.5.1
> Reporter: Ron Serruya
> Priority: Minor
> Labels: alias
>
> When using an `explode` function, and `transform` function in the same select
> statement, aliases used inside the transformed column are ignored.
> This behavior only happens using the pyspark API and the scala API, but not
> when using the SQL API
>
> {code:java}
> from pyspark.sql import functions as F
> # Create the df
> df = spark.createDataFrame([
> {"id": 1, "array1": ['a', 'b'], 'array2': [2,3,4]}
> ]){code}
> Good case, where all aliases are used
>
> {code:java}
> df.select(
> F.transform(
> 'array2',
> lambda x: F.struct(x.alias("some_alias"),
> F.col("id").alias("second_alias"))
> ).alias("new_array2")
> ).printSchema()
> root
> |-- new_array2: array (nullable = true)
> | |-- element: struct (containsNull = false)
> | | |-- some_alias: long (nullable = true)
> | | |-- second_alias: long (nullable = true){code}
> Bad case, when using explode, the alises inside the transformed column is
> ignored, and `id` is kept instead of `second_alias`, and `x_17` is used
> instead of `some_alias`
>
>
> {code:java}
> df.select(
> F.explode("array1").alias("exploded"),
> F.transform(
> 'array2',
> lambda x: F.struct(x.alias("some_alias"),
> F.col("id").alias("second_alias"))
> ).alias("new_array2")
> ).printSchema()
> root
> |-- exploded: string (nullable = true)
> |-- new_array2: array (nullable = true)
> | |-- element: struct (containsNull = false)
> | | |-- x_17: long (nullable = true)
> | | |-- id: long (nullable = true) {code}
>
> {code:scala}
> import org.apache.spark.sql.functions._
> var df2 = df.select(array(lit(1), lit(2), lit(3)).as("my_array"),
> array(lit(1), lit(2), lit(3)).as("my_array2"))
> df2.select(
> explode($"my_array").as("exploded"),
> transform($"my_array2", (x) => struct(x.as("data"))).as("my_struct")
> ).printSchema
> {code}
> {noformat}
> root
> |-- exploded: integer (nullable = false)
> |-- my_struct: array (nullable = false)
> | |-- element: struct (containsNull = false)
> | | |-- x_2: integer (nullable = false)
> {noformat}
>
> When using the SQL API instead, it works fine
> {code:java}
> spark.sql(
> """
> select explode(array1) as exploded, transform(array2, x-> struct(x as
> some_alias, id as second_alias)) as array2 from {df}
> """, df=df
> ).printSchema()
> root
> |-- exploded: string (nullable = true)
> |-- array2: array (nullable = true)
> | |-- element: struct (containsNull = false)
> | | |-- some_alias: long (nullable = true)
> | | |-- second_alias: long (nullable = true) {code}
>
> Workaround: for now, using F.named_struct can be used as a workaround
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]