Re: [Spark Core]: Python and Scala generate different DAGs for identical code

lucas.g...@gmail.com Wed, 10 May 2017 10:05:24 -0700

Any chance of a link to that video :)

Thanks!


G

On 10 May 2017 at 09:49, Holden Karau <hol...@pigscanfly.ca> wrote:

> So this Python side pipelining happens in a lot of places which can make
> debugging extra challenging. Some people work around this with persist
> which breaks the pipelining during debugging, but if your interested in
> more general Python debugging I've got a YouTube video on the topic which
> could be a good intro (of course I'm pretty biased about that).
>
> On Wed, May 10, 2017 at 9:42 AM Pavel Klemenkov <pklemen...@gmail.com>
> wrote:
>
>> Thanks for the quick answer, Holden!
>>
>> Are there any other tricks with PySpark which are hard to debug using UI
>> or toDebugString?
>>
>> On Wed, May 10, 2017 at 7:18 PM, Holden Karau <hol...@pigscanfly.ca>
>> wrote:
>>
>>> In PySpark the filter and then map steps are combined into a single
>>> transformation from the JVM point of view. This allows us to avoid copying
>>> the data back to Scala in between the filter and the map steps. The
>>> debugging exeperience is certainly much harder in PySpark and I think is an
>>> interesting area for those interested in contributing :)
>>>
>>> On Wed, May 10, 2017 at 7:33 AM pklemenkov <pklemen...@gmail.com> wrote:
>>>
>>>> This Scala code:
>>>> scala> val logs = sc.textFile("big_data_specialization/log.txt").
>>>>      | filter(x => !x.contains("INFO")).
>>>>      | map(x => (x.split("\t")(1), 1)).
>>>>      | reduceByKey((x, y) => x + y)
>>>>
>>>> generated obvious lineage:
>>>>
>>>> (2) ShuffledRDD[4] at reduceByKey at <console>:27 []
>>>>  +-(2) MapPartitionsRDD[3] at map at <console>:26 []
>>>>     |  MapPartitionsRDD[2] at filter at <console>:25 []
>>>>     |  big_data_specialization/log.txt MapPartitionsRDD[1] at textFile
>>>> at
>>>> <console>:24 []
>>>>     |  big_data_specialization/log.txt HadoopRDD[0] at textFile at
>>>> <console>:24 []
>>>>
>>>> But Python code:
>>>>
>>>> logs = sc.textFile("../log.txt")\
>>>>          .filter(lambda x: 'INFO' not in x)\
>>>>          .map(lambda x: (x.split('\t')[1], 1))\
>>>>          .reduceByKey(lambda x, y: x + y)
>>>>
>>>> generated something strange which is hard to follow:
>>>>
>>>> (2) PythonRDD[13] at RDD at PythonRDD.scala:48 []
>>>>  |  MapPartitionsRDD[12] at mapPartitions at PythonRDD.scala:422 []
>>>>  |  ShuffledRDD[11] at partitionBy at NativeMethodAccessorImpl.java:0
>>>> []
>>>>  +-(2) PairwiseRDD[10] at reduceByKey at <ipython-input-9-d6a34e0335b0>:1
>>>> []
>>>>     |  PythonRDD[9] at reduceByKey at <ipython-input-9-d6a34e0335b0>:1
>>>> []
>>>>     |  ../log.txt MapPartitionsRDD[8] at textFile at
>>>> NativeMethodAccessorImpl.java:0 []
>>>>     |  ../log.txt HadoopRDD[7] at textFile at
>>>> NativeMethodAccessorImpl.java:0 []
>>>>
>>>> Why is that? Does pyspark do some optimizations under the hood? This
>>>> debug
>>>> string is really useless for debugging.
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context: http://apache-spark-user-list.
>>>> 1001560.n3.nabble.com/Spark-Core-Python-and-Scala-
>>>> generate-different-DAGs-for-identical-code-tp28674.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>>
>>>> --
>>> Cell : 425-233-8271 <(425)%20233-8271>
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>
>>
>>
>> --
>> Yours faithfully, Pavel Klemenkov.
>>
> --
> Cell : 425-233-8271 <(425)%20233-8271>
> Twitter: https://twitter.com/holdenkarau
>

Re: [Spark Core]: Python and Scala generate different DAGs for identical code

Reply via email to