[jira] [Commented] (SPARK-23959) UnresolvedException with DataSet created from Seq.empty since Spark 2.3.0

Sam hendley (JIRA) Wed, 26 Dec 2018 11:16:49 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-23959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16729148#comment-16729148
 ]


Sam hendley commented on SPARK-23959:
-------------------------------------

I am working on upgrading a medium sized project from spark 2.0.2 to spark 
2.3.0 and ran into this bug in a few of my unit tests. The reproduction case 
included in this ticket fails in my environment. Adding a `.cache()` to `zs` 
seems to fix the issue as expected. It looks like the code in question is 
working in my production environment when all of the input datasets are 
populated and loaded from parquet files. 

In my tests I was using createDataset() calls to store intermediate results. If 
I check for empty input data and call .cache() on the resulting frame it works 
in my unit tests.

Do you have any guesses on what might be different in my environment that would 
make this fail? I tried changing hadoop versions (2.6.5 and 2.8.3) and spark 
versions (2.3.0 and 2.3.2) but was still able to reproduce this issue. Is there 
anything I can do to help you debug this issue?

> UnresolvedException with DataSet created from Seq.empty since Spark 2.3.0
> -------------------------------------------------------------------------
>
>                 Key: SPARK-23959
>                 URL: https://issues.apache.org/jira/browse/SPARK-23959
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Sam De Backer
>            Priority: Major
>
> The following snippet works fine in Spark 2.2.1 but gives a rather cryptic 
> runtime exception in Spark 2.3.0:
> {code:java}
> import sparkSession.implicits._
> import org.apache.spark.sql.functions._
> case class X(xid: Long, yid: Int)
> case class Y(yid: Int, zid: Long)
> case class Z(zid: Long, b: Boolean)
> val xs = Seq(X(1L, 10)).toDS()
> val ys = Seq(Y(10, 100L)).toDS()
> val zs = Seq.empty[Z].toDS()
> val j = xs
>   .join(ys, "yid")
>   .join(zs, Seq("zid"), "left")
>   .withColumn("BAM", when('b, "B").otherwise("NB"))
> j.show(){code}
> In Spark 2.2.1 it prints to the console
> {noformat}
> +---+---+---+----+---+
> |zid|yid|xid|   b|BAM|
> +---+---+---+----+---+
> |100| 10|  1|null| NB|
> +---+---+---+----+---+{noformat}
> In Spark 2.3.0 it results in:
> {noformat}
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> dataType on unresolved object, tree: 'BAM
> at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:105)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$fromAttributes$1.apply(StructType.scala:435)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$fromAttributes$1.apply(StructType.scala:435)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.immutable.List.foreach(List.scala:392)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.immutable.List.map(List.scala:296)
> at org.apache.spark.sql.types.StructType$.fromAttributes(StructType.scala:435)
> at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.schema$lzycompute(QueryPlan.scala:157)
> ...{noformat}
> The culprit really seems to be DataSet being created from an empty Seq[Z]. 
> When you change that to something that will also result in an empty 
> DataSet[Z] it works as in Spark 2.2.1, e.g.
> {code:java}
> val zs = Seq(Z(10L, true)).toDS().filter('zid < Long.MinValue){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23959) UnresolvedException with DataSet created from Seq.empty since Spark 2.3.0

Reply via email to