Hi Mohit,

 

I’m not sure that there is a “correct” answer here, but I tend to use classes 
whenever the input or output data represents something meaningful (such as a 
domain model object). I would recommend against creating many temporary classes 
for each and every transformation step as that may be difficult to maintain 
over time.

 

Using withColumn statements will continue to work, and you don’t need to cast 
to your output class until you’ve setup all tranformations. Therefore, you can 
do things like:

 

case class A (f1, f2, f3)

case class B (f1, f2, f3, f4, f5, f6)

 

ds_a = spark.read.csv(“path”).as[A]

ds_b = ds_a

  .withColumn(“f4”, someUdf)

  .withColumn(“f5”, someUdf)

  .withColumn(“f6”, someUdf)

  .as[B]

 

Kevin

 

From: Mohit Jaggi <mohitja...@gmail.com> 
Sent: Tuesday, January 15, 2019 1:31 PM
To: user <user@spark.apache.org>
Subject: dataset best practice question

 

Fellow Spark Coders,

I am trying to move from using Dataframes to Datasets for a reasonably large 
code base. Today the code looks like this:

 

df_a= read_csv

df_b = df.withColumn ( some_transform_that_adds_more_columns )

//repeat the above several times

 

With datasets, this will require defining

 

case class A { f1, f2, f3 } //fields from csv file

case class B { f1, f2, f3, f4 } //union of A and new field added by 
some_transform_that_adds_more_columns

//repeat this 10 times

 

Is there a better way? 

 

Mohit.

Reply via email to