[jira] [Commented] (SPARK-34830) Some UDF calls inside transform are broken

Daniel Solow (Jira) Mon, 22 Mar 2021 18:10:09 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-34830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17306678#comment-17306678
 ]


Daniel Solow commented on SPARK-34830:
--------------------------------------

I think it's the same thing as yours -- it's overwriting earlier array results 
with later results, but it only keeps a number of characters based on the 
overwritten value. So in the example I showed, the first result is "abc" and 
the second result is "defg" -- when it overwrites "abc" with "defg" it keeps 
only the first 3 characters "def" because "abc" is three characters. In your 
example you're using fix-length types so this effect isn't evident.

The overwritten strings are also null-terminated (there's actually a \x00 byte 
at the end).

> Some UDF calls inside transform are broken
> ------------------------------------------
>
>                 Key: SPARK-34830
>                 URL: https://issues.apache.org/jira/browse/SPARK-34830
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.1.1
>            Reporter: Daniel Solow
>            Priority: Major
>
> Let's say I want to create a UDF to do a simple lookup on a string:
> {code:java}
> import org.apache.spark.sql.{functions => f}
> val M = Map("a" -> "abc", "b" -> "defg")
> val BM = spark.sparkContext.broadcast(M)
> val LOOKUP = f.udf((s: String) => BM.value.get(s))
> {code}
> Now if I have the following dataframe:
> {code:java}
> val df = Seq(
>     Tuple1(Seq("a", "b"))
> ).toDF("arr")
> {code}
> and I want to run this UDF over each element in the array, I can do:
> {code:java}
> df.select(f.transform($"arr", i => LOOKUP(i)).as("arr")).show(false)
> {code}
> This should show:
> {code:java}
> +-----------+
> |arr        |
> +-----------+
> |[abc, defg]|
> +-----------+
> {code}
> However it actually shows:
> {code:java}
> +-----------+
> |arr        |
> +-----------+
> |[def, defg]|
> +-----------+
> {code}
> It's also broken for SQL (even without DSL). This gives the same result:
> {code:java}
> spark.udf.register("LOOKUP",(s: String) => BM.value.get(s))
> df.selectExpr("TRANSFORM(arr, a -> LOOKUP(a)) AS arr").show(false)
> {code}
> Note that "def" is not even in the map I'm using.
> This is a big problem because it breaks existing code/UDFs. I noticed this 
> because the job I ported from 2.4.5 to 3.1.1 seemed to be working, but was 
> actually producing broken data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34830) Some UDF calls inside transform are broken

Reply via email to