cloud-fan commented on code in PR #55892:
URL: https://github.com/apache/spark/pull/55892#discussion_r3264153913
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##########
@@ -3260,9 +3260,11 @@ class Analyzer(
// The star will be expanded differently if we insert `Generate` under
`Project` too early.
case p @ Project(projectList, child) if
!projectList.exists(_.exists(_.isInstanceOf[Star])) =>
val (resolvedGenerator, newProjectList) = projectList
- .map(trimNonTopLevelAliases)
.foldLeft((None: Option[Generate], Nil: Seq[NamedExpression])) {
(res, e) =>
- e match {
+ // SPARK-48091: Only trim aliases on the generator expression
itself. Trimming
+ // non-generator expressions strips aliases inside lambda
functions (e.g.,
+ // struct(x.as("data"))) before they can be resolved into struct
field names.
Review Comment:
nit: this reads as "we only call `trimNonTopLevelAliases` on the generator,"
but trim runs on every element -- the trimmed form is just only **kept** when
the element matches `AliasedGenerator`.
More importantly: this comment localizes a global mismatch. The real reason
`struct(x.as("data"))` loses its alias here is that `ExtractGenerator` runs
before `ResolveFunctions`, so the inner `struct` is still `UnresolvedFunction`,
and `trimAliases`' `CreateNamedStruct` special case cannot fire. The sibling
`Aggregate`-with-generator branch above (lines 3211-3253) has the same
`.map(trimNonTopLevelAliases)` upfront and the same alias-stripping bug. Could
the fix instead live in `trimAliases` itself (skip unresolved subtrees), so
both paths get fixed and the comment does not need to encode the rule order?
If you do keep the local fix, suggest tightening the wording:
```suggestion
// SPARK-48091: ExtractGenerator runs before ResolveFunctions,
so the inner
// `struct(...)` in lambda bodies is still UnresolvedFunction
here. Trimming
// descends into it and strips Alias children that
ResolveFunctions would
// otherwise consume as struct field names. Trim only for
AliasedGenerator
// detection; splice the original `e` into the new project list.
CleanupAliases
// trims at end-of-analysis, after struct is resolved.
```
##########
sql/core/src/test/scala/org/apache/spark/sql/GeneratorFunctionSuite.scala:
##########
@@ -765,6 +765,37 @@ class GeneratorFunctionSuite extends SharedSparkSession {
Seq(Row(0, 10, 0, 10), Row(1, 20, 1, 20))
)
}
+
+ test("SPARK-48091: explode with transform should preserve struct field
aliases") {
+ val df = spark.createDataFrame(Seq((1, Array(1, 2, 3), Array(4, 5, 6))))
+ .toDF("id", "my_array", "my_array2")
+
+ // Without explode - aliases should work (baseline)
+ val good = df.select(
+ transform(col("my_array2"), x => struct(x.as("data"))).as("my_struct")
+ )
+ assert(good.schema("my_struct").dataType.asInstanceOf[types.ArrayType]
Review Comment:
nit: `ArrayType` is used three times (lines 777, 785, 795) via the
package-relative `types.ArrayType` while `StructType`/`IntegerType` are
imported explicitly on line 28. For consistency, import `ArrayType` and drop
the `types.` prefix:
```scala
import org.apache.spark.sql.types.{ArrayType, IntegerType, StructType}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]