Re: Schema Evolution for nested Dataset[T]

Michael Armbrust Mon, 01 May 2017 10:32:03 -0700

Oh, and if you want a default other than null:

import org.apache.spark.sql.functions._
df.withColumn("address", coalesce($"address", lit(<default>))


On Mon, May 1, 2017 at 10:29 AM, Michael Armbrust <mich...@databricks.com>
wrote:

> The following should work:
>
> val schema = implicitly[org.apache.spark.sql.Encoder[Course]].schema
> spark.read.schema(schema).parquet("data.parquet").as[Course]
>
> Note this will only work for nullable files (i.e. if you add a primitive
> like Int you need to make it an Option[Int])
>
> On Sun, Apr 30, 2017 at 9:12 PM, Mike Wheeler <
> rotationsymmetr...@gmail.com> wrote:
>
>> Hi Spark Users,
>>
>> Suppose I have some data (stored in parquet for example) generated as
>> below:
>>
>> package com.company.entity.old
>> case class Course(id: Int, students: List[Student])
>> case class Student(name: String)
>>
>> Then usually I can access the data by
>>
>> spark.read.parquet("data.parquet").as[Course]
>>
>> Now I want to add a new field `address` to Student:
>>
>> package com.company.entity.new
>> case class Course(id: Int, students: List[Student])
>> case class Student(name: String, address: String)
>>
>> Then obviously running `spark.read.parquet("data.parquet").as[Course]`
>> on data generated by the old entity/schema will fail because `address`
>> is missing.
>>
>> In this case, what is the best practice to read data generated with
>> the old entity/schema to the new entity/schema, with the missing field
>> set to some default value? I know I can manually write a function to
>> do the transformation from the old to the new. But it is kind of
>> tedious. Any automatic methods?
>>
>> Thanks,
>>
>> Mike
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>

Re: Schema Evolution for nested Dataset[T]

Reply via email to