[ https://issues.apache.org/jira/browse/SPARK-34794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17338411#comment-17338411 ]
Apache Spark commented on SPARK-34794: -------------------------------------- User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/32424 > Nested higher-order functions broken in DSL > ------------------------------------------- > > Key: SPARK-34794 > URL: https://issues.apache.org/jira/browse/SPARK-34794 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.1.1, 3.2.0 > Environment: 3.1.1 > Reporter: Daniel Solow > Priority: Major > Labels: correctness > > In Spark 3, if I have: > {code:java} > val df = Seq( > (Seq(1,2,3), Seq("a", "b", "c")) > ).toDF("numbers", "letters") > {code} > and I want to take the cross product of these two arrays, I can do the > following in SQL: > {code:java} > df.selectExpr(""" > FLATTEN( > TRANSFORM( > numbers, > number -> TRANSFORM( > letters, > letter -> (number AS number, letter AS letter) > ) > ) > ) AS zipped > """).show(false) > +------------------------------------------------------------------------+ > |zipped | > +------------------------------------------------------------------------+ > |[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]| > +------------------------------------------------------------------------+ > {code} > This works fine. But when I try the equivalent using the scala DSL, the > result is wrong: > {code:java} > df.select( > f.flatten( > f.transform( > $"numbers", > (number: Column) => { f.transform( > $"letters", > (letter: Column) => { f.struct( > number.as("number"), > letter.as("letter") > ) } > ) } > ) > ).as("zipped") > ).show(10, false) > +------------------------------------------------------------------------+ > |zipped | > +------------------------------------------------------------------------+ > |[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]| > +------------------------------------------------------------------------+ > {code} > Note that the numbers are not included in the output. The explain for this > second version is: > {code:java} > == Parsed Logical Plan == > 'Project [flatten(transform('numbers, lambdafunction(transform('letters, > lambdafunction(struct(NamePlaceholder, lambda 'x AS number#442, > NamePlaceholder, lambda 'x AS letter#443), lambda 'x, false)), lambda 'x, > false))) AS zipped#444] > +- Project [_1#303 AS numbers#308, _2#304 AS letters#309] > +- LocalRelation [_1#303, _2#304] > == Analyzed Logical Plan == > zipped: array<struct<number:string,letter:string>> > Project [flatten(transform(numbers#308, lambdafunction(transform(letters#309, > lambdafunction(struct(number, lambda x#446, letter, lambda x#446), lambda > x#446, false)), lambda x#445, false))) AS zipped#444] > +- Project [_1#303 AS numbers#308, _2#304 AS letters#309] > +- LocalRelation [_1#303, _2#304] > == Optimized Logical Plan == > LocalRelation [zipped#444] > == Physical Plan == > LocalTableScan [zipped#444] > {code} > Seems like variable name x is hardcoded. And sure enough: > https://github.com/apache/spark/blob/v3.1.1/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3647 -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org