Yes, unfortunately. This should actually be fixed, and the union's schema should have the less restrictive of the DataFrames.
On Mon, May 8, 2017 at 12:46 PM, Dirceu Semighini Filho < dirceu.semigh...@gmail.com> wrote: > HI Burak, > By nullability you mean that if I have the exactly the same schema, but > one side support null and the other doesn't, this exception (in union > dataset) will be thrown? > > > > 2017-05-08 16:41 GMT-03:00 Burak Yavuz <brk...@gmail.com>: > >> I also want to add that generally these may be caused by the >> `nullability` field in the schema. >> >> On Mon, May 8, 2017 at 12:25 PM, Shixiong(Ryan) Zhu < >> shixi...@databricks.com> wrote: >> >>> This is because RDD.union doesn't check the schema, so you won't see the >>> problem unless you run RDD and hit the incompatible column problem. For >>> RDD, You may not see any error if you don't use the incompatible column. >>> >>> Dataset.union requires compatible schema. You can print ds.schema and >>> ds1.schema and check if they are same. >>> >>> On Mon, May 8, 2017 at 11:07 AM, Dirceu Semighini Filho < >>> dirceu.semigh...@gmail.com> wrote: >>> >>>> Hello, >>>> I've a very complex case class structure, with a lot of fields. >>>> When I try to union two datasets of this class, it doesn't work with >>>> the following error : >>>> ds.union(ds1) >>>> Exception in thread "main" org.apache.spark.sql.AnalysisException: >>>> Union can only be performed on tables with the compatible column types >>>> >>>> But when use it's rdd, the union goes right: >>>> ds.rdd.union(ds1.rdd) >>>> res8: org.apache.spark.rdd.RDD[ >>>> >>>> Is there any reason for this to happen (besides a bug ;) ) >>>> >>>> >>>> >>> >> >