When running spark from spark-shell, when each defined variable created the
shell prints out the type signature of that variable along with the
toString of the instance.

how can i programmatically generated the same signature without using the
shell (for debugging purposes) from a spark script or class?

example code run in spark shell (see bold output below)

------------------------------------------------------------------------------------
code:
------------------------------------------------------------------------------------
val data = Array("one", "two", "three", "two", "three", "three")
val dataRdd = sc.parallelize(data)
val dataTupleRdd =  dataRdd.map(word => (word, 1))
val countsRdd = dataTupleRdd.reduceByKey(_ + _)
countsRdd.foreach(println)


------------------------------------------------------------------------------------
code run in spark shell (see bold output below: i want to generate that
from the api)
------------------------------------------------------------------------------------


Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.1.0
      /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java
1.8.0_45)
Type in expressions to have them evaluated.
Type :help for more information.

scala> val data = Array("one", "two", "three", "two", "three", "three")
*data: Array[String]* = Array(one, two, three, two, three, three)

scala> val dataRdd = sc.parallelize(data)
*dataRdd: org.apache.spark.rdd.RDD[String] *= ParallelCollectionRDD[0] at
parallelize at <console>:26

scala> val dataTupleRdd =  dataRdd.map(word => (word, 1))
*dataTupleRdd: org.apache.spark.rdd.RDD[(String, Int)] *=
MapPartitionsRDD[1] at map at <console>:28

scala> val countsRdd = dataTupleRdd.reduceByKey(_ + _)
*countsRdd: org.apache.spark.rdd.RDD[(String, Int)]* = ShuffledRDD[2] at
reduceByKey at <console>:30

scala> countsRdd.foreach(println)
(two,2)
(one,1)
(three,3)

Reply via email to