Re: Why does dataset.union fails but dataset.rdd.union execute correctly?

Matthew cao Tue, 09 May 2017 00:55:40 -0700

Hi,
I have tried simple test like this:
case class A(id: Long)
val sample = spark.range(0,10).as[A]
sample.createOrReplaceTempView("sample")
val df = spark.emptyDataset[A]
val df1 = spark.sql("select * from sample").as[A]
df.union(df1)


It runs ok. And for nullabillity I thought that issue has been fixed: 
https://issues.apache.org/jira/browse/SPARK-18058 
<https://issues.apache.org/jira/browse/SPARK-18058>
I think you can check your spark version and schema of dataset again? Hope this 
help.

Best,
> On 2017年5月9日, at 04:56, Dirceu Semighini Filho <dirceu.semigh...@gmail.com> 
> wrote:
> 
> Ok, great,
> Well I havn't provided a good example of what I'm doing. Let's assume that my 
> case  class is 
> case class A(tons of fields, with sub classes)
> 
> val df = sqlContext.sql("select * from a").as[A]
> 
> val df2 = spark.emptyDataset[A]
> 
> df.union(df2)
> 
> This code will throw the exception.
> Is this expected? I assume that when I do as[A] it will convert the schema to 
> the case class schema, and it shouldn't throw the exception, or this will be 
> done lazy when the union is been processed?
> 
> 
> 
> 2017-05-08 17:50 GMT-03:00 Burak Yavuz <brk...@gmail.com 
> <mailto:brk...@gmail.com>>:
> Yes, unfortunately. This should actually be fixed, and the union's schema 
> should have the less restrictive of the DataFrames.
> 
> On Mon, May 8, 2017 at 12:46 PM, Dirceu Semighini Filho 
> <dirceu.semigh...@gmail.com <mailto:dirceu.semigh...@gmail.com>> wrote:
> HI Burak, 
> By nullability you mean that if I have the exactly the same schema, but one 
> side support null and the other doesn't, this exception (in union dataset) 
> will be thrown? 
> 
> 
> 
> 2017-05-08 16:41 GMT-03:00 Burak Yavuz <brk...@gmail.com 
> <mailto:brk...@gmail.com>>:
> I also want to add that generally these may be caused by the `nullability` 
> field in the schema. 
> 
> On Mon, May 8, 2017 at 12:25 PM, Shixiong(Ryan) Zhu <shixi...@databricks.com 
> <mailto:shixi...@databricks.com>> wrote:
> This is because RDD.union doesn't check the schema, so you won't see the 
> problem unless you run RDD and hit the incompatible column problem. For RDD, 
> You may not see any error if you don't use the incompatible column.
> 
> Dataset.union requires compatible schema. You can print ds.schema and 
> ds1.schema and check if they are same.
> 
> On Mon, May 8, 2017 at 11:07 AM, Dirceu Semighini Filho 
> <dirceu.semigh...@gmail.com <mailto:dirceu.semigh...@gmail.com>> wrote:
> Hello,
> I've a very complex case class structure, with a lot of fields.
> When I try to union two datasets of this class, it doesn't work with the 
> following error :
> ds.union(ds1)
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Union can 
> only be performed on tables with the compatible column types
> 
> But when use it's rdd, the union goes right:
> ds.rdd.union(ds1.rdd)
> res8: org.apache.spark.rdd.RDD[
> 
> Is there any reason for this to happen (besides a bug ;) )
> 
> 
> 
> 
> 
> 
>

Re: Why does dataset.union fails but dataset.rdd.union execute correctly?

Reply via email to