Hi All,

Below is the small piece of code in scala and python REPL in Apache 
Spark.However I am getting different output in both the language when I execute 
toDebugString.I am using cloudera quick start VM.

PYTHON

rdd2 = 
sc.textFile('file:/home/training/training_materials/data/frostroad.txt').map(lambda
 x:x.upper()).filter(lambda x : 'THE' in x)

print rdd2.toDebugString()
(1) PythonRDD[56] at RDD at PythonRDD.scala:42 []
 |  file:/home/training/training_materials/data/frostroad.txt 
MapPartitionsRDD[55] at textFile at NativeMethodAccessorImpl.java:-2 []
 |  file:/home/training/training_materials/data/frostroad.txt HadoopRDD[54] at 
textFile at ......

SCALA

 val rdd2 = 
sc.textFile("file:/home/training/training_materials/data/frostroad.txt").map(x 
=> x.toUpperCase()).filter(x => x.contains("THE"))



rdd2.toDebugString
res1: String =
(1) MapPartitionsRDD[3] at filter at <console>:21 []
 |  MapPartitionsRDD[2] at map at <console>:21 []
 |  file:/home/training/training_materials/data/frostroad.txt 
MapPartitionsRDD[1] at textFile at <console>:21 []
 |  file:/home/training/training_materials/data/frostroad.txt HadoopRDD[0] at 
textFile at <


Also one of cloudera slides say that the default partitions  is 2 however its 1 
(looking at output of toDebugString).


Appreciate any help.


Thanks

Deepak Sharma

Reply via email to