Daniel Darabos created SPARK-23666:
--------------------------------------

             Summary: Undeterministic column name with UDFs
                 Key: SPARK-23666
                 URL: https://issues.apache.org/jira/browse/SPARK-23666
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.3.0, 2.2.0
            Reporter: Daniel Darabos


When you access structure fields in Spark SQL, the auto-generated result column 
name includes an internal ID.
{code:java}
scala> import spark.implicits._
scala> Seq(((1, 2), 3)).toDF("a", "b").createOrReplaceTempView("x")
scala> spark.udf.register("f", (a: Int) => a)
scala> spark.sql("select f(a._1) from x").show
+---------------------+
|UDF:f(a._1 AS _1#148)|
+---------------------+
|                    1|
+---------------------+
{code}
This ID ({{#148}}) is only included for UDFs.
{code:java}
scala> spark.sql("select factorial(a._1) from x").show
+-----------------------+
|factorial(a._1 AS `_1`)|
+-----------------------+
|                      1|
+-----------------------+
{code}
The internal ID is different on every invocation. The problem this causes for 
us is that the schema of the SQL output is never the same:
{code:java}
scala> spark.sql("select f(a._1) from x").schema ==
       spark.sql("select f(a._1) from x").schema
Boolean = false
{code}
We rely on similar schema checks when reloading persisted data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to