Hi All,
Below is the small piece of code in scala and python REPL in Apache Spark.However I am getting different output in both the language when I execute toDebugString.I am using cloudera quick start VM. PYTHON rdd2 = sc.textFile('file:/home/training/training_materials/data/frostroad.txt').map(lambda x:x.upper()).filter(lambda x : 'THE' in x) print rdd2.toDebugString() (1) PythonRDD[56] at RDD at PythonRDD.scala:42 [] | file:/home/training/training_materials/data/frostroad.txt MapPartitionsRDD[55] at textFile at NativeMethodAccessorImpl.java:-2 [] | file:/home/training/training_materials/data/frostroad.txt HadoopRDD[54] at textFile at ...... SCALA val rdd2 = sc.textFile("file:/home/training/training_materials/data/frostroad.txt").map(x => x.toUpperCase()).filter(x => x.contains("THE")) rdd2.toDebugString res1: String = (1) MapPartitionsRDD[3] at filter at <console>:21 [] | MapPartitionsRDD[2] at map at <console>:21 [] | file:/home/training/training_materials/data/frostroad.txt MapPartitionsRDD[1] at textFile at <console>:21 [] | file:/home/training/training_materials/data/frostroad.txt HadoopRDD[0] at textFile at < Also one of cloudera slides say that the default partitions is 2 however its 1 (looking at output of toDebugString). Appreciate any help. Thanks Deepak Sharma