cloud-fan commented on code in PR #55892:
URL: https://github.com/apache/spark/pull/55892#discussion_r3264153913


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##########
@@ -3260,9 +3260,11 @@ class Analyzer(
       // The star will be expanded differently if we insert `Generate` under 
`Project` too early.
       case p @ Project(projectList, child) if 
!projectList.exists(_.exists(_.isInstanceOf[Star])) =>
         val (resolvedGenerator, newProjectList) = projectList
-          .map(trimNonTopLevelAliases)
           .foldLeft((None: Option[Generate], Nil: Seq[NamedExpression])) { 
(res, e) =>
-            e match {
+            // SPARK-48091: Only trim aliases on the generator expression 
itself. Trimming
+            // non-generator expressions strips aliases inside lambda 
functions (e.g.,
+            // struct(x.as("data"))) before they can be resolved into struct 
field names.

Review Comment:
   nit: this reads as "we only call `trimNonTopLevelAliases` on the generator," 
but trim runs on every element -- the trimmed form is just only **kept** when 
the element matches `AliasedGenerator`.
   
   More importantly: this comment localizes a global mismatch. The real reason 
`struct(x.as("data"))` loses its alias here is that `ExtractGenerator` runs 
before `ResolveFunctions`, so the inner `struct` is still `UnresolvedFunction`, 
and `trimAliases`' `CreateNamedStruct` special case cannot fire. The sibling 
`Aggregate`-with-generator branch above (lines 3211-3253) has the same 
`.map(trimNonTopLevelAliases)` upfront and the same alias-stripping bug. Could 
the fix instead live in `trimAliases` itself (skip unresolved subtrees), so 
both paths get fixed and the comment does not need to encode the rule order?
   
   If you do keep the local fix, suggest tightening the wording:
   
   ```suggestion
               // SPARK-48091: ExtractGenerator runs before ResolveFunctions, 
so the inner
               // `struct(...)` in lambda bodies is still UnresolvedFunction 
here. Trimming
               // descends into it and strips Alias children that 
ResolveFunctions would
               // otherwise consume as struct field names. Trim only for 
AliasedGenerator
               // detection; splice the original `e` into the new project list. 
CleanupAliases
               // trims at end-of-analysis, after struct is resolved.
   ```



##########
sql/core/src/test/scala/org/apache/spark/sql/GeneratorFunctionSuite.scala:
##########
@@ -765,6 +765,37 @@ class GeneratorFunctionSuite extends SharedSparkSession {
       Seq(Row(0, 10, 0, 10), Row(1, 20, 1, 20))
     )
   }
+
+  test("SPARK-48091: explode with transform should preserve struct field 
aliases") {
+    val df = spark.createDataFrame(Seq((1, Array(1, 2, 3), Array(4, 5, 6))))
+      .toDF("id", "my_array", "my_array2")
+
+    // Without explode - aliases should work (baseline)
+    val good = df.select(
+      transform(col("my_array2"), x => struct(x.as("data"))).as("my_struct")
+    )
+    assert(good.schema("my_struct").dataType.asInstanceOf[types.ArrayType]

Review Comment:
   nit: `ArrayType` is used three times (lines 777, 785, 795) via the 
package-relative `types.ArrayType` while `StructType`/`IntegerType` are 
imported explicitly on line 28. For consistency, import `ArrayType` and drop 
the `types.` prefix:
   
   ```scala
   import org.apache.spark.sql.types.{ArrayType, IntegerType, StructType}
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to