Daniel Darabos created SPARK-23666: -------------------------------------- Summary: Undeterministic column name with UDFs Key: SPARK-23666 URL: https://issues.apache.org/jira/browse/SPARK-23666 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0, 2.2.0 Reporter: Daniel Darabos
When you access structure fields in Spark SQL, the auto-generated result column name includes an internal ID. {code:java} scala> import spark.implicits._ scala> Seq(((1, 2), 3)).toDF("a", "b").createOrReplaceTempView("x") scala> spark.udf.register("f", (a: Int) => a) scala> spark.sql("select f(a._1) from x").show +---------------------+ |UDF:f(a._1 AS _1#148)| +---------------------+ | 1| +---------------------+ {code} This ID ({{#148}}) is only included for UDFs. {code:java} scala> spark.sql("select factorial(a._1) from x").show +-----------------------+ |factorial(a._1 AS `_1`)| +-----------------------+ | 1| +-----------------------+ {code} The internal ID is different on every invocation. The problem this causes for us is that the schema of the SQL output is never the same: {code:java} scala> spark.sql("select f(a._1) from x").schema == spark.sql("select f(a._1) from x").schema Boolean = false {code} We rely on similar schema checks when reloading persisted data. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org