Daniel Solow created SPARK-34794: ------------------------------------ Summary: Nested higher-order functions broken in DSL Key: SPARK-34794 URL: https://issues.apache.org/jira/browse/SPARK-34794 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.1 Environment: 3.1.1 Reporter: Daniel Solow
In Spark 3, if I have: {code:java} val df = Seq( (Seq(1,2,3), Seq("a", "b", "c")) ).toDF("numbers", "letters") {code} and I want to take the cross product of these two arrays, I can do the following in SQL: {code:java} df.selectExpr(""" FLATTEN( TRANSFORM( numbers, number -> TRANSFORM( letters, letter -> (number AS number, letter AS letter) ) ) ) AS zipped """).show(false) +------------------------------------------------------------------------+ |zipped | +------------------------------------------------------------------------+ |[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]| +------------------------------------------------------------------------+ {code} This works fine. But when I try the equivalent using the scala DSL, the result is wrong: {code:java} df.select( f.flatten( f.transform( $"numbers", (number: Column) => { f.transform( $"letters", (letter: Column) => { f.struct( number.as("number"), letter.as("letter") ) } ) } ) ).as("zipped") ).show(10, false) +------------------------------------------------------------------------+ |zipped | +------------------------------------------------------------------------+ |[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]| +------------------------------------------------------------------------+ {code} Note that the numbers are not included in the output. The explain for this second version is: {code:java} == Parsed Logical Plan == 'Project [flatten(transform('numbers, lambdafunction(transform('letters, lambdafunction(struct(NamePlaceholder, lambda 'x AS number#442, NamePlaceholder, lambda 'x AS letter#443), lambda 'x, false)), lambda 'x, false))) AS zipped#444] +- Project [_1#303 AS numbers#308, _2#304 AS letters#309] +- LocalRelation [_1#303, _2#304] == Analyzed Logical Plan == zipped: array<struct<number:string,letter:string>> Project [flatten(transform(numbers#308, lambdafunction(transform(letters#309, lambdafunction(struct(number, lambda x#446, letter, lambda x#446), lambda x#446, false)), lambda x#445, false))) AS zipped#444] +- Project [_1#303 AS numbers#308, _2#304 AS letters#309] +- LocalRelation [_1#303, _2#304] == Optimized Logical Plan == LocalRelation [zipped#444] == Physical Plan == LocalTableScan [zipped#444] {code} Seems like variable name x is hardcoded. And sure enough: https://github.com/apache/spark/blob/v3.1.1/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3647 -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org