Daniel Solow created SPARK-34830: ------------------------------------ Summary: Some UDF calls inside transform are broken Key: SPARK-34830 URL: https://issues.apache.org/jira/browse/SPARK-34830 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.1 Reporter: Daniel Solow
Let's say I want to create a UDF to do a simple lookup on a string: {code:java} import org.apache.spark.sql.{functions => f} val M = Map("a" -> "abc", "b" -> "defg") val BM = spark.sparkContext.broadcast(M) val LOOKUP = f.udf((s: String) => BM.value.get(s)) {code} Now if I have the following dataframe: {code:java} val df = Seq( Tuple1(Seq("a", "b")) ).toDF("arr") {code} and I want to run this UDF over each element in the array, I can do: {code:java} df.select(f.transform($"arr", i => LOOKUP(i)).as("arr")).show(false) {code} This should show: {code:java} +-----------+ |arr | +-----------+ |[abc, defg]| +-----------+ {code} However it actually shows: {code:java} +-----------+ |arr | +-----------+ |[def, defg]| +-----------+ {code} Note that "def" is not even in the map I'm using. This is a big problem because it breaks existing code/UDFs. I noticed this because the job I ported from 2.4.5 to 3.1.1 seemed to be working, but was actually producing broken data. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org