Hello
I am converting some py code to scala.
This works in python:
rdd = sc.parallelize([('apple',1),('orange',2)])
rdd.toDF(['fruit','num']).show()
+--+---+
| fruit|num|
+--+---+
| apple| 1|
|orange| 2|
+--+---+
And in scala:
scala> rdd.toDF("fruit","num").show()
+--+---+
|
It's just a possibly tidier way to represent objects with named, typed
fields, in order to specify a DataFrame's contents.
On Tue, Feb 8, 2022 at 4:16 AM wrote:
> Hello
>
> I am converting some py code to scala.
> This works in python:
>
> >>> rdd = sc.parallelize([('apple',1),('orange',2)])
> >
As Sean mentioned Scala case class is a handy way of representing objects
with names and types. For example, if you are reading a csv file with
spaced column names like "counter party" etc and you want a more
compact column name like counterparty etc
scala> val location="hdfs://rhes75:9000/tmp/c
I know that using case class I can control the data type strictly.
scala> val rdd = sc.parallelize(List(("apple",1),("orange",2)))
rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[0]
at parallelize at :23
scala> rdd.toDF.printSchema
root
|-- _1: string (nullable = true)
|
I think it's better as:
df1.map { case(w,x,y,z) => columns(w,x,y,z) }
Thanks
On 2022/2/9 12:46, Mich Talebzadeh wrote:
scala> val df2 = df1.map(p => columns(p(0).toString,p(1).toString,
p(2).toString,p(3).toString.toDouble)) // map those columns
-