Schema Evolution for nested Dataset[T]

Mike Wheeler Sun, 30 Apr 2017 21:12:38 -0700

Hi Spark Users,

Suppose I have some data (stored in parquet for example) generated as below:


package com.company.entity.old
case class Course(id: Int, students: List[Student])
case class Student(name: String)

Then usually I can access the data by

spark.read.parquet("data.parquet").as[Course]

Now I want to add a new field `address` to Student:

package com.company.entity.new
case class Course(id: Int, students: List[Student])
case class Student(name: String, address: String)

Then obviously running `spark.read.parquet("data.parquet").as[Course]`
on data generated by the old entity/schema will fail because `address`
is missing.

In this case, what is the best practice to read data generated with
the old entity/schema to the new entity/schema, with the missing field
set to some default value? I know I can manually write a function to
do the transformation from the old to the new. But it is kind of
tedious. Any automatic methods?

Thanks,

Mike

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Schema Evolution for nested Dataset[T]

Reply via email to