Hi, I have tried simple test like this: case class A(id: Long) val sample = spark.range(0,10).as[A] sample.createOrReplaceTempView("sample") val df = spark.emptyDataset[A] val df1 = spark.sql("select * from sample").as[A] df.union(df1)
It runs ok. And for nullabillity I thought that issue has been fixed: https://issues.apache.org/jira/browse/SPARK-18058 <https://issues.apache.org/jira/browse/SPARK-18058> I think you can check your spark version and schema of dataset again? Hope this help. Best, > On 2017年5月9日, at 04:56, Dirceu Semighini Filho <dirceu.semigh...@gmail.com> > wrote: > > Ok, great, > Well I havn't provided a good example of what I'm doing. Let's assume that my > case class is > case class A(tons of fields, with sub classes) > > val df = sqlContext.sql("select * from a").as[A] > > val df2 = spark.emptyDataset[A] > > df.union(df2) > > This code will throw the exception. > Is this expected? I assume that when I do as[A] it will convert the schema to > the case class schema, and it shouldn't throw the exception, or this will be > done lazy when the union is been processed? > > > > 2017-05-08 17:50 GMT-03:00 Burak Yavuz <brk...@gmail.com > <mailto:brk...@gmail.com>>: > Yes, unfortunately. This should actually be fixed, and the union's schema > should have the less restrictive of the DataFrames. > > On Mon, May 8, 2017 at 12:46 PM, Dirceu Semighini Filho > <dirceu.semigh...@gmail.com <mailto:dirceu.semigh...@gmail.com>> wrote: > HI Burak, > By nullability you mean that if I have the exactly the same schema, but one > side support null and the other doesn't, this exception (in union dataset) > will be thrown? > > > > 2017-05-08 16:41 GMT-03:00 Burak Yavuz <brk...@gmail.com > <mailto:brk...@gmail.com>>: > I also want to add that generally these may be caused by the `nullability` > field in the schema. > > On Mon, May 8, 2017 at 12:25 PM, Shixiong(Ryan) Zhu <shixi...@databricks.com > <mailto:shixi...@databricks.com>> wrote: > This is because RDD.union doesn't check the schema, so you won't see the > problem unless you run RDD and hit the incompatible column problem. For RDD, > You may not see any error if you don't use the incompatible column. > > Dataset.union requires compatible schema. You can print ds.schema and > ds1.schema and check if they are same. > > On Mon, May 8, 2017 at 11:07 AM, Dirceu Semighini Filho > <dirceu.semigh...@gmail.com <mailto:dirceu.semigh...@gmail.com>> wrote: > Hello, > I've a very complex case class structure, with a lot of fields. > When I try to union two datasets of this class, it doesn't work with the > following error : > ds.union(ds1) > Exception in thread "main" org.apache.spark.sql.AnalysisException: Union can > only be performed on tables with the compatible column types > > But when use it's rdd, the union goes right: > ds.rdd.union(ds1.rdd) > res8: org.apache.spark.rdd.RDD[ > > Is there any reason for this to happen (besides a bug ;) ) > > > > > > >