Re: Schema Evolution for nested Dataset[T]

Michael Armbrust Mon, 01 May 2017 10:30:47 -0700

The following should work:

val schema = implicitly[org.apache.spark.sql.Encoder[Course]].schema
spark.read.schema(schema).parquet("data.parquet").as[Course]


Note this will only work for nullable files (i.e. if you add a primitive
like Int you need to make it an Option[Int])

On Sun, Apr 30, 2017 at 9:12 PM, Mike Wheeler <rotationsymmetr...@gmail.com>
wrote:

> Hi Spark Users,
>
> Suppose I have some data (stored in parquet for example) generated as
> below:
>
> package com.company.entity.old
> case class Course(id: Int, students: List[Student])
> case class Student(name: String)
>
> Then usually I can access the data by
>
> spark.read.parquet("data.parquet").as[Course]
>
> Now I want to add a new field `address` to Student:
>
> package com.company.entity.new
> case class Course(id: Int, students: List[Student])
> case class Student(name: String, address: String)
>
> Then obviously running `spark.read.parquet("data.parquet").as[Course]`
> on data generated by the old entity/schema will fail because `address`
> is missing.
>
> In this case, what is the best practice to read data generated with
> the old entity/schema to the new entity/schema, with the missing field
> set to some default value? I know I can manually write a function to
> do the transformation from the old to the new. But it is kind of
> tedious. Any automatic methods?
>
> Thanks,
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: Schema Evolution for nested Dataset[T]

Reply via email to