Oh, and if you want a default other than null: import org.apache.spark.sql.functions._ df.withColumn("address", coalesce($"address", lit(<default>))
On Mon, May 1, 2017 at 10:29 AM, Michael Armbrust <mich...@databricks.com> wrote: > The following should work: > > val schema = implicitly[org.apache.spark.sql.Encoder[Course]].schema > spark.read.schema(schema).parquet("data.parquet").as[Course] > > Note this will only work for nullable files (i.e. if you add a primitive > like Int you need to make it an Option[Int]) > > On Sun, Apr 30, 2017 at 9:12 PM, Mike Wheeler < > rotationsymmetr...@gmail.com> wrote: > >> Hi Spark Users, >> >> Suppose I have some data (stored in parquet for example) generated as >> below: >> >> package com.company.entity.old >> case class Course(id: Int, students: List[Student]) >> case class Student(name: String) >> >> Then usually I can access the data by >> >> spark.read.parquet("data.parquet").as[Course] >> >> Now I want to add a new field `address` to Student: >> >> package com.company.entity.new >> case class Course(id: Int, students: List[Student]) >> case class Student(name: String, address: String) >> >> Then obviously running `spark.read.parquet("data.parquet").as[Course]` >> on data generated by the old entity/schema will fail because `address` >> is missing. >> >> In this case, what is the best practice to read data generated with >> the old entity/schema to the new entity/schema, with the missing field >> set to some default value? I know I can manually write a function to >> do the transformation from the old to the new. But it is kind of >> tedious. Any automatic methods? >> >> Thanks, >> >> Mike >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> >