Any chance of a link to that video :) Thanks!
G On 10 May 2017 at 09:49, Holden Karau <hol...@pigscanfly.ca> wrote: > So this Python side pipelining happens in a lot of places which can make > debugging extra challenging. Some people work around this with persist > which breaks the pipelining during debugging, but if your interested in > more general Python debugging I've got a YouTube video on the topic which > could be a good intro (of course I'm pretty biased about that). > > On Wed, May 10, 2017 at 9:42 AM Pavel Klemenkov <pklemen...@gmail.com> > wrote: > >> Thanks for the quick answer, Holden! >> >> Are there any other tricks with PySpark which are hard to debug using UI >> or toDebugString? >> >> On Wed, May 10, 2017 at 7:18 PM, Holden Karau <hol...@pigscanfly.ca> >> wrote: >> >>> In PySpark the filter and then map steps are combined into a single >>> transformation from the JVM point of view. This allows us to avoid copying >>> the data back to Scala in between the filter and the map steps. The >>> debugging exeperience is certainly much harder in PySpark and I think is an >>> interesting area for those interested in contributing :) >>> >>> On Wed, May 10, 2017 at 7:33 AM pklemenkov <pklemen...@gmail.com> wrote: >>> >>>> This Scala code: >>>> scala> val logs = sc.textFile("big_data_specialization/log.txt"). >>>> | filter(x => !x.contains("INFO")). >>>> | map(x => (x.split("\t")(1), 1)). >>>> | reduceByKey((x, y) => x + y) >>>> >>>> generated obvious lineage: >>>> >>>> (2) ShuffledRDD[4] at reduceByKey at <console>:27 [] >>>> +-(2) MapPartitionsRDD[3] at map at <console>:26 [] >>>> | MapPartitionsRDD[2] at filter at <console>:25 [] >>>> | big_data_specialization/log.txt MapPartitionsRDD[1] at textFile >>>> at >>>> <console>:24 [] >>>> | big_data_specialization/log.txt HadoopRDD[0] at textFile at >>>> <console>:24 [] >>>> >>>> But Python code: >>>> >>>> logs = sc.textFile("../log.txt")\ >>>> .filter(lambda x: 'INFO' not in x)\ >>>> .map(lambda x: (x.split('\t')[1], 1))\ >>>> .reduceByKey(lambda x, y: x + y) >>>> >>>> generated something strange which is hard to follow: >>>> >>>> (2) PythonRDD[13] at RDD at PythonRDD.scala:48 [] >>>> | MapPartitionsRDD[12] at mapPartitions at PythonRDD.scala:422 [] >>>> | ShuffledRDD[11] at partitionBy at NativeMethodAccessorImpl.java:0 >>>> [] >>>> +-(2) PairwiseRDD[10] at reduceByKey at <ipython-input-9-d6a34e0335b0>:1 >>>> [] >>>> | PythonRDD[9] at reduceByKey at <ipython-input-9-d6a34e0335b0>:1 >>>> [] >>>> | ../log.txt MapPartitionsRDD[8] at textFile at >>>> NativeMethodAccessorImpl.java:0 [] >>>> | ../log.txt HadoopRDD[7] at textFile at >>>> NativeMethodAccessorImpl.java:0 [] >>>> >>>> Why is that? Does pyspark do some optimizations under the hood? This >>>> debug >>>> string is really useless for debugging. >>>> >>>> >>>> >>>> -- >>>> View this message in context: http://apache-spark-user-list. >>>> 1001560.n3.nabble.com/Spark-Core-Python-and-Scala- >>>> generate-different-DAGs-for-identical-code-tp28674.html >>>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>>> >>>> -- >>> Cell : 425-233-8271 <(425)%20233-8271> >>> Twitter: https://twitter.com/holdenkarau >>> >> >> >> >> -- >> Yours faithfully, Pavel Klemenkov. >> > -- > Cell : 425-233-8271 <(425)%20233-8271> > Twitter: https://twitter.com/holdenkarau >