Daniel Solow created SPARK-34830:
------------------------------------

             Summary: Some UDF calls inside transform are broken
                 Key: SPARK-34830
                 URL: https://issues.apache.org/jira/browse/SPARK-34830
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 3.1.1
            Reporter: Daniel Solow


Let's say I want to create a UDF to do a simple lookup on a string:

{code:java}
import org.apache.spark.sql.{functions => f}
val M = Map("a" -> "abc", "b" -> "defg")
val BM = spark.sparkContext.broadcast(M)
val LOOKUP = f.udf((s: String) => BM.value.get(s))
{code}

Now if I have the following dataframe:

{code:java}
val df = Seq(
    Tuple1(Seq("a", "b"))
).toDF("arr")
{code}

and I want to run this UDF over each element in the array, I can do:

{code:java}
df.select(f.transform($"arr", i => LOOKUP(i)).as("arr")).show(false)
{code}

This should show:

{code:java}
+-----------+
|arr        |
+-----------+
|[abc, defg]|
+-----------+
{code}
However it actually shows:

{code:java}
+-----------+
|arr        |
+-----------+
|[def, defg]|
+-----------+
{code}

Note that "def" is not even in the map I'm using.

This is a big problem because it breaks existing code/UDFs. I noticed this 
because the job I ported from 2.4.5 to 3.1.1 seemed to be working, but was 
actually producing broken data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to