Re: question on the different way of RDD to dataframe

2022-02-08 Thread frakass
I think it's better as: df1.map { case(w,x,y,z) => columns(w,x,y,z) } Thanks On 2022/2/9 12:46, Mich Talebzadeh wrote: scala> val df2 = df1.map(p => columns(p(0).toString,p(1).toString, p(2).toString,p(3).toString.toDouble)) // map those columns

Re: question on the different way of RDD to dataframe

2022-02-08 Thread frakass
I know that using case class I can control the data type strictly. scala> val rdd = sc.parallelize(List(("apple",1),("orange",2))) rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[0] at parallelize at :23 scala> rdd.toDF.printSchema root |-- _1: string (nullable = true)

Re: question on the different way of RDD to dataframe

2022-02-08 Thread Mich Talebzadeh
As Sean mentioned Scala case class is a handy way of representing objects with names and types. For example, if you are reading a csv file with spaced column names like "counter party" etc and you want a more compact column name like counterparty etc scala> val

Re: question on the different way of RDD to dataframe

2022-02-08 Thread Sean Owen
It's just a possibly tidier way to represent objects with named, typed fields, in order to specify a DataFrame's contents. On Tue, Feb 8, 2022 at 4:16 AM wrote: > Hello > > I am converting some py code to scala. > This works in python: > > >>> rdd = sc.parallelize([('apple',1),('orange',2)]) >

question on the different way of RDD to dataframe

2022-02-08 Thread capitnfrakass
Hello I am converting some py code to scala. This works in python: rdd = sc.parallelize([('apple',1),('orange',2)]) rdd.toDF(['fruit','num']).show() +--+---+ | fruit|num| +--+---+ | apple| 1| |orange| 2| +--+---+ And in scala: scala> rdd.toDF("fruit","num").show() +--+---+