[ https://issues.apache.org/jira/browse/SPARK-23959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16729148#comment-16729148 ]
Sam hendley commented on SPARK-23959: ------------------------------------- I am working on upgrading a medium sized project from spark 2.0.2 to spark 2.3.0 and ran into this bug in a few of my unit tests. The reproduction case included in this ticket fails in my environment. Adding a `.cache()` to `zs` seems to fix the issue as expected. It looks like the code in question is working in my production environment when all of the input datasets are populated and loaded from parquet files. In my tests I was using createDataset() calls to store intermediate results. If I check for empty input data and call .cache() on the resulting frame it works in my unit tests. Do you have any guesses on what might be different in my environment that would make this fail? I tried changing hadoop versions (2.6.5 and 2.8.3) and spark versions (2.3.0 and 2.3.2) but was still able to reproduce this issue. Is there anything I can do to help you debug this issue? > UnresolvedException with DataSet created from Seq.empty since Spark 2.3.0 > ------------------------------------------------------------------------- > > Key: SPARK-23959 > URL: https://issues.apache.org/jira/browse/SPARK-23959 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.3.0 > Reporter: Sam De Backer > Priority: Major > > The following snippet works fine in Spark 2.2.1 but gives a rather cryptic > runtime exception in Spark 2.3.0: > {code:java} > import sparkSession.implicits._ > import org.apache.spark.sql.functions._ > case class X(xid: Long, yid: Int) > case class Y(yid: Int, zid: Long) > case class Z(zid: Long, b: Boolean) > val xs = Seq(X(1L, 10)).toDS() > val ys = Seq(Y(10, 100L)).toDS() > val zs = Seq.empty[Z].toDS() > val j = xs > .join(ys, "yid") > .join(zs, Seq("zid"), "left") > .withColumn("BAM", when('b, "B").otherwise("NB")) > j.show(){code} > In Spark 2.2.1 it prints to the console > {noformat} > +---+---+---+----+---+ > |zid|yid|xid| b|BAM| > +---+---+---+----+---+ > |100| 10| 1|null| NB| > +---+---+---+----+---+{noformat} > In Spark 2.3.0 it results in: > {noformat} > org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to > dataType on unresolved object, tree: 'BAM > at > org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:105) > at > org.apache.spark.sql.types.StructType$$anonfun$fromAttributes$1.apply(StructType.scala:435) > at > org.apache.spark.sql.types.StructType$$anonfun$fromAttributes$1.apply(StructType.scala:435) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:392) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:296) > at org.apache.spark.sql.types.StructType$.fromAttributes(StructType.scala:435) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.schema$lzycompute(QueryPlan.scala:157) > ...{noformat} > The culprit really seems to be DataSet being created from an empty Seq[Z]. > When you change that to something that will also result in an empty > DataSet[Z] it works as in Spark 2.2.1, e.g. > {code:java} > val zs = Seq(Z(10L, true)).toDS().filter('zid < Long.MinValue){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org