[ https://issues.apache.org/jira/browse/SPARK-38485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tanel Kiis updated SPARK-38485: ------------------------------- Description: When adding fields to a result of a non-deterministic UDF, that returns a struct, then that UDF is executed multiple times (once per field) for each row. In this UT df1 passes, but df2 fails with something like: "279751724 did not equal -1023188908" {code} test("SPARK-XXXXX: non-deterministic UDF should be called once when adding fields") { val nondeterministicUDF = udf((s: Int) => { val r = Random.nextInt() // Both values should be the same GroupByKey(r, r) }).asNondeterministic() val df1 = spark.range(5).select(nondeterministicUDF($"id")) df1.collect().foreach { row => assert(row.getStruct(0).getInt(0) == row.getStruct(0).getInt(1)) } val df2 = spark.range(5).select(nondeterministicUDF($"id").withField("new", lit(7))) df2.collect().foreach { row => assert(row.getStruct(0).getInt(0) == row.getStruct(0).getInt(1)) } } {code} was: When adding fields to a result of a non-deterministic UDF, that returns a struct, then that UDF is executed multiple times (once per field) for each row. In this UT df1 passes, but df2 fails with something like: "279751724 did not equal -1023188908" {code} test("SPARK-XXXXX: non-deterministic UDF should be called once when adding fields") { val nondeterministicUDF = udf((s: Int) => { val r = Random.nextInt() // Both values should be the same GroupByKey(r, r) }).asNondeterministic() val df1 = spark.range(5).select( nondeterministicUDF($"id").as("struct")) df1.collect().foreach { row => assert(row.getStruct(0).getInt(0) == row.getStruct(0).getInt(1)) } val df2 = spark.range(5).select( nondeterministicUDF($"id").withField("new", lit(7)).as("struct")) df2.collect().foreach { row => assert(row.getStruct(0).getInt(0) == row.getStruct(0).getInt(1)) } } {code} > Non-deterministic UDF executed multiple times when combined with withField > -------------------------------------------------------------------------- > > Key: SPARK-38485 > URL: https://issues.apache.org/jira/browse/SPARK-38485 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.3.0 > Reporter: Tanel Kiis > Priority: Major > Labels: Correctness > > When adding fields to a result of a non-deterministic UDF, that returns a > struct, then that UDF is executed multiple times (once per field) for each > row. > In this UT df1 passes, but df2 fails with something like: > "279751724 did not equal -1023188908" > {code} > test("SPARK-XXXXX: non-deterministic UDF should be called once when adding > fields") { > val nondeterministicUDF = udf((s: Int) => { > val r = Random.nextInt() > // Both values should be the same > GroupByKey(r, r) > }).asNondeterministic() > val df1 = spark.range(5).select(nondeterministicUDF($"id")) > df1.collect().foreach { > row => assert(row.getStruct(0).getInt(0) == row.getStruct(0).getInt(1)) > } > val df2 = > spark.range(5).select(nondeterministicUDF($"id").withField("new", lit(7))) > df2.collect().foreach { > row => assert(row.getStruct(0).getInt(0) == row.getStruct(0).getInt(1)) > } > } > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org