Re: Schema Evolution for nested Dataset[T]

2017-05-02 Thread Michael Armbrust
Unfortunately there is not an easy way to add nested columns (though I do think we should implement the API you attempted to use). You'll have to build the struct manually. allData.withColumn("student", struct($"student.name", coalesce($"student.age", lit(0)) as 'age) You could automate the cons

Re: Schema Evolution for nested Dataset[T]

2017-05-02 Thread Mike Wheeler
Hi Michael, Thank you for the suggestions. I am wondering how I can make `withColumn` to handle nested structure? For example, below is my code to generate the data. I basically add the `age` field to `Person2`, which is nested in an Array for Course2. Then I want to fill in 0 for age with age is

Re: Schema Evolution for nested Dataset[T]

2017-05-01 Thread Michael Armbrust
Oh, and if you want a default other than null: import org.apache.spark.sql.functions._ df.withColumn("address", coalesce($"address", lit()) On Mon, May 1, 2017 at 10:29 AM, Michael Armbrust wrote: > The following should work: > > val schema = implicitly[org.apache.spark.sql.Encoder[Course]].sch

Re: Schema Evolution for nested Dataset[T]

2017-05-01 Thread Michael Armbrust
The following should work: val schema = implicitly[org.apache.spark.sql.Encoder[Course]].schema spark.read.schema(schema).parquet("data.parquet").as[Course] Note this will only work for nullable files (i.e. if you add a primitive like Int you need to make it an Option[Int]) On Sun, Apr 30, 2017

Schema Evolution for nested Dataset[T]

2017-04-30 Thread Mike Wheeler
Hi Spark Users, Suppose I have some data (stored in parquet for example) generated as below: package com.company.entity.old case class Course(id: Int, students: List[Student]) case class Student(name: String) Then usually I can access the data by spark.read.parquet("data.parquet").as[Course] N