The implementation inside the Python API and Scala API for RDD is slightly
different, so the difference of RDD lineage you printed is expected.

On Tue, Aug 16, 2016 at 10:58 AM, DEEPAK SHARMA <deepak_dehra...@outlook.com
> wrote:

> Hi All,
>
>
> Below is the small piece of code in scala and python REPL in Apache
> Spark.However I am getting different output in both the language when I
> execute toDebugString.I am using cloudera quick start VM.
>
> PYTHON
>
> rdd2 = sc.textFile('file:/home/training/training_materials/
> data/frostroad.txt').map(lambda x:x.upper()).filter(lambda x : 'THE' in x)
>
> print rdd2.toDebugString()(1) PythonRDD[56] at RDD at PythonRDD.scala:42 []
>  |  file:/home/training/training_materials/data/frostroad.txt 
> MapPartitionsRDD[55] at textFile at NativeMethodAccessorImpl.java:-2 []
>  |  file:/home/training/training_materials/data/frostroad.txt HadoopRDD[54] 
> at textFile at ......
>
> SCALA
>
>  val rdd2 = 
> sc.textFile("file:/home/training/training_materials/data/frostroad.txt").map(x
>  => x.toUpperCase()).filter(x => x.contains("THE"))
>
>
>
> rdd2.toDebugString
> res1: String = (1) MapPartitionsRDD[3] at filter at <console>:21 []
>  |  MapPartitionsRDD[2] at map at <console>:21 []
>  |  file:/home/training/training_materials/data/frostroad.txt 
> MapPartitionsRDD[1] at textFile at <console>:21 []
>  |  file:/home/training/training_materials/data/frostroad.txt HadoopRDD[0] at 
> textFile at <
>
>
> Also one of cloudera slides say that the default partitions  is 2 however
> its 1 (looking at output of toDebugString).
>
>
> Appreciate any help.
>
>
> Thanks
>
> Deepak Sharma
>

Reply via email to