Hi Koert,

these case classes you are talking about, should be serializeable to be
efficient (like kryo or just plain java serialization).

DataFrame is not simply a collection of Rows (which are serializeable by
default), it also contains a schema with different type for each column.
This way any columnar data may be represented without creating custom
case classes each time.

If you want to manipulate a collection of case classes, why not use good
old RDDs? (Or DataSets if you are using Spark 2.0)
If you want to use sql against that collection, you will need to explain
to your application how to read it as a table (by transforming it to a
DataFrame)

Regards
--
  Bedrytski Aliaksandr
  sp...@bedryt.ski



On Sun, Sep 25, 2016, at 23:41, Koert Kuipers wrote:
> after having gotten used to have case classes represent complex
> structures in Datasets, i am surprised to find out that when i work in
> DataFrames with udfs no such magic exists, and i have to fall back to
> manipulating Row objects, which is error prone and somewhat ugly.
> for example:
> case class Person(name: String, age: Int)
>
> val df = Seq((Person("john", 33), 5), (Person("mike", 30),
> 6)).toDF("person", "id")
> val df1 = df.withColumn("person", udf({ (p: Person) => p.copy(age =
> p.age + 1) }).apply(col("person")))
> df1.printSchema
> df1.show
> leads to:
> java.lang.ClassCastException:
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot
> be cast to Person

Reply via email to