Re: Why does dataset.union fails but dataset.rdd.union execute correctly?

Burak Yavuz Mon, 08 May 2017 13:50:50 -0700

Yes, unfortunately. This should actually be fixed, and the union's schema
should have the less restrictive of the DataFrames.


On Mon, May 8, 2017 at 12:46 PM, Dirceu Semighini Filho <
dirceu.semigh...@gmail.com> wrote:

> HI Burak,
> By nullability you mean that if I have the exactly the same schema, but
> one side support null and the other doesn't, this exception (in union
> dataset) will be thrown?
>
>
>
> 2017-05-08 16:41 GMT-03:00 Burak Yavuz <brk...@gmail.com>:
>
>> I also want to add that generally these may be caused by the
>> `nullability` field in the schema.
>>
>> On Mon, May 8, 2017 at 12:25 PM, Shixiong(Ryan) Zhu <
>> shixi...@databricks.com> wrote:
>>
>>> This is because RDD.union doesn't check the schema, so you won't see the
>>> problem unless you run RDD and hit the incompatible column problem. For
>>> RDD, You may not see any error if you don't use the incompatible column.
>>>
>>> Dataset.union requires compatible schema. You can print ds.schema and
>>> ds1.schema and check if they are same.
>>>
>>> On Mon, May 8, 2017 at 11:07 AM, Dirceu Semighini Filho <
>>> dirceu.semigh...@gmail.com> wrote:
>>>
>>>> Hello,
>>>> I've a very complex case class structure, with a lot of fields.
>>>> When I try to union two datasets of this class, it doesn't work with
>>>> the following error :
>>>> ds.union(ds1)
>>>> Exception in thread "main" org.apache.spark.sql.AnalysisException:
>>>> Union can only be performed on tables with the compatible column types
>>>>
>>>> But when use it's rdd, the union goes right:
>>>> ds.rdd.union(ds1.rdd)
>>>> res8: org.apache.spark.rdd.RDD[
>>>>
>>>> Is there any reason for this to happen (besides a bug ;) )
>>>>
>>>>
>>>>
>>>
>>
>

Re: Why does dataset.union fails but dataset.rdd.union execute correctly?

Reply via email to