Debugging PySpark is admittedly difficult, and I’ll agree too that the docs can 
be lacking at times.  PySpark docstrings are sometimes just missing or are 
incomplete.  While I don’t write in Scala, I find that the Scala-spark source 
code and docs can fill in the PySpark gaps, and can only point you to there 
until I contribute improved documentation.

If you’re using PyCharm, its possible to connect be able to debug and step 
through your applications.  It just requires that you set an 
`os.environ[‘SPARK_HOME’] = <path2spark>`, and operate in local mode.  I see 
different setups for debugging with PyCharm, but this one worked for me.  
Additionally, I kept hitting problems with the JavaGateWay in PySpark not being 
able to connect; finally, dumping the HomeBrew version of Spark and downloading 
the Apache-hosted version fixed my problem.  They do not appear to be the same 
once opened up on your local machine.

Not to oversell it, but you cannot actually see the contents of RDD’s, and 
cannot step into certain code that gets executed inside of a map function (eg 
through some lambdas) easily.

However, on the upside you can still step through most of your code, AND pause 
to take a `collect()` or `first()` to sanity check things along the way.

Hint, you can use PyCharm to execute those functions that are called on the 
executor.  Use the PyCharm debugger, get the RDD in your variable scope, take 
the first couple of observations (via collect or take) through the “evaluate 
expression” tool, and pass them through the function In expression evaluator.

Hope this helps --

Michael Mansour

--
Michael Mansour
Data Scientist
Symantec Cloud Security


From: Pavel Klemenkov <pklemen...@gmail.com>
Date: Wednesday, May 10, 2017 at 10:43 AM
To: "user@spark.apache.org" <user@spark.apache.org>
Subject: [EXT] Re: [Spark Core]: Python and Scala generate different DAGs for 
identical code

This video https://www.youtube.com/watch?v=LQHMMCf2ZWY I think.

On Wed, May 10, 2017 at 8:04 PM, 
lucas.g...@gmail.com<mailto:lucas.g...@gmail.com> 
<lucas.g...@gmail.com<mailto:lucas.g...@gmail.com>> wrote:
Any chance of a link to that video :)

Thanks!

G

On 10 May 2017 at 09:49, Holden Karau 
<hol...@pigscanfly.ca<mailto:hol...@pigscanfly.ca>> wrote:
So this Python side pipelining happens in a lot of places which can make 
debugging extra challenging. Some people work around this with persist which 
breaks the pipelining during debugging, but if your interested in more general 
Python debugging I've got a YouTube video on the topic which could be a good 
intro (of course I'm pretty biased about that).

On Wed, May 10, 2017 at 9:42 AM Pavel Klemenkov 
<pklemen...@gmail.com<mailto:pklemen...@gmail.com>> wrote:
Thanks for the quick answer, Holden!

Are there any other tricks with PySpark which are hard to debug using UI or 
toDebugString?

On Wed, May 10, 2017 at 7:18 PM, Holden Karau 
<hol...@pigscanfly.ca<mailto:hol...@pigscanfly.ca>> wrote:
In PySpark the filter and then map steps are combined into a single 
transformation from the JVM point of view. This allows us to avoid copying the 
data back to Scala in between the filter and the map steps. The debugging 
exeperience is certainly much harder in PySpark and I think is an interesting 
area for those interested in contributing :)

On Wed, May 10, 2017 at 7:33 AM pklemenkov 
<pklemen...@gmail.com<mailto:pklemen...@gmail.com>> wrote:
This Scala code:
scala> val logs = sc.textFile("big_data_specialization/log.txt").
     | filter(x => !x.contains("INFO")).
     | map(x => (x.split("\t")(1), 1)).
     | reduceByKey((x, y) => x + y)

generated obvious lineage:

(2) ShuffledRDD[4] at reduceByKey at <console>:27 []
 +-(2) MapPartitionsRDD[3] at map at <console>:26 []
    |  MapPartitionsRDD[2] at filter at <console>:25 []
    |  big_data_specialization/log.txt MapPartitionsRDD[1] at textFile at
<console>:24 []
    |  big_data_specialization/log.txt HadoopRDD[0] at textFile at
<console>:24 []

But Python code:

logs = sc.textFile("../log.txt")\
         .filter(lambda x: 'INFO' not in x)\
         .map(lambda x: (x.split('\t')[1], 1))\
         .reduceByKey(lambda x, y: x + y)

generated something strange which is hard to follow:

(2) PythonRDD[13] at RDD at PythonRDD.scala:48 []
 |  MapPartitionsRDD[12] at mapPartitions at PythonRDD.scala:422 []
 |  ShuffledRDD[11] at partitionBy at NativeMethodAccessorImpl.java:0 []
 +-(2) PairwiseRDD[10] at reduceByKey at <ipython-input-9-d6a34e0335b0>:1 []
    |  PythonRDD[9] at reduceByKey at <ipython-input-9-d6a34e0335b0>:1 []
    |  ../log.txt MapPartitionsRDD[8] at textFile at
NativeMethodAccessorImpl.java:0 []
    |  ../log.txt HadoopRDD[7] at textFile at
NativeMethodAccessorImpl.java:0 []

Why is that? Does pyspark do some optimizations under the hood? This debug
string is really useless for debugging.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Core-Python-and-Scala-generate-different-DAGs-for-identical-code-tp28674.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>
--
Cell : 425-233-8271<tel:(425)%20233-8271>
Twitter: https://twitter.com/holdenkarau



--
Yours faithfully, Pavel Klemenkov.
--
Cell : 425-233-8271<tel:(425)%20233-8271>
Twitter: https://twitter.com/holdenkarau




--
Yours faithfully, Pavel Klemenkov.

Reply via email to