[ https://issues.apache.org/jira/browse/SPARK-34830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17306678#comment-17306678 ]
Daniel Solow commented on SPARK-34830: -------------------------------------- I think it's the same thing as yours -- it's overwriting earlier array results with later results, but it only keeps a number of characters based on the overwritten value. So in the example I showed, the first result is "abc" and the second result is "defg" -- when it overwrites "abc" with "defg" it keeps only the first 3 characters "def" because "abc" is three characters. In your example you're using fix-length types so this effect isn't evident. The overwritten strings are also null-terminated (there's actually a \x00 byte at the end). > Some UDF calls inside transform are broken > ------------------------------------------ > > Key: SPARK-34830 > URL: https://issues.apache.org/jira/browse/SPARK-34830 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.1.1 > Reporter: Daniel Solow > Priority: Major > > Let's say I want to create a UDF to do a simple lookup on a string: > {code:java} > import org.apache.spark.sql.{functions => f} > val M = Map("a" -> "abc", "b" -> "defg") > val BM = spark.sparkContext.broadcast(M) > val LOOKUP = f.udf((s: String) => BM.value.get(s)) > {code} > Now if I have the following dataframe: > {code:java} > val df = Seq( > Tuple1(Seq("a", "b")) > ).toDF("arr") > {code} > and I want to run this UDF over each element in the array, I can do: > {code:java} > df.select(f.transform($"arr", i => LOOKUP(i)).as("arr")).show(false) > {code} > This should show: > {code:java} > +-----------+ > |arr | > +-----------+ > |[abc, defg]| > +-----------+ > {code} > However it actually shows: > {code:java} > +-----------+ > |arr | > +-----------+ > |[def, defg]| > +-----------+ > {code} > It's also broken for SQL (even without DSL). This gives the same result: > {code:java} > spark.udf.register("LOOKUP",(s: String) => BM.value.get(s)) > df.selectExpr("TRANSFORM(arr, a -> LOOKUP(a)) AS arr").show(false) > {code} > Note that "def" is not even in the map I'm using. > This is a big problem because it breaks existing code/UDFs. I noticed this > because the job I ported from 2.4.5 to 3.1.1 seemed to be working, but was > actually producing broken data. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org