Actually, I stumbled on this SO page
<https://stackoverflow.com/questions/31245083/how-can-pyspark-be-called-in-debug-mode>.
While it is not straightforward, it is a fairly simple solution.
In short:
- I made sure there is only one executing task at a time by calling
repartition(1) - this made it easy to locate the one and only spark deamon
- I set a BP wherever I needed to
- In order to "catch" the BP, I set a print out and a time.sleep(15)
right before it. The print out gives me a notice that the daemon is up and
running
and the sleep gives me time to push a few buttons so I can attache to
the procesa
It worked fairly well, and I was able to debug the executor. I did notice
two strange things: sometimes I got a strange error and the debugger didnt
actually attach. It was not deterministic.
Other times I noticed a big gap between the point I got the notification
and attached to the process until the execution was resumed and I could
actually step through (by big gap I mean a gap that is considerably bigger
than the sleep period, usually about 1 minute).
Not perfect but worked most of the time.
On Wed, Mar 14, 2018 at 12:07 AM, Michael Mansour <
michael_mans...@symantec.com> wrote:
> Vitaliy,
>
>
>
> From what I understand, this is not possible to do. However, let me share
> my workaround with you.
>
>
>
> Assuming you have your debugger up and running on PyCharm, set a
> breakpoint at this line, Take|collect|sample your data (could also
> consider doing a glom if its critical the data remain partitioned, then the
> take/collect), and pass it into the function directly (direct python, no
> spark). Use the debugger to step through there on that small sample.
>
>
>
> Alternatively, you can open up the PyCharm execution module. In the
> execution module, do the same as above with the RDD, and pass it into the
> function. This alleviates the need to write debugging code etc. I find
> this model useful and a bit more fast, but it does not offer the
> step-through capability.
>
>
>
> Best of luck!
>
> M
>
> --
>
> Michael Mansour
>
> Data Scientist
>
> Symantec CASB
>
> *From: *Vitaliy Pisarev
> *Date: *Sunday, March 11, 2018 at 8:46 AM
> *To: *"user@spark.apache.org"
> *Subject: *[EXT] Debugging a local spark executor in pycharm
>
>
>
> I want to step through the work of a spark executor running locally on my
> machine, from Pycharm.
>
> I am running explicit functionality, in the form of
> dataset.foreachPartition(f) and I want to see what is going on inside f.
>
> Is there a straightforward way to do it or do I need to resort to remote
> debugging?
>
> p.s
>
>
>
> Posted this on SO
> <https://clicktime.symantec.com/a/1/XYlpjXLSKwNlpHDPBadGLxedp5mPjvfMuIlrQmppyAU=?d=8u87emKOH4QJ5KsylIZ3a-sj91IJnMz4MC8WJu6O0ofmn_lSUUdS7RWXMwSMEcMeFkt9iEhnGU-qrxp9tMvOOjLgl2AMzpSBuLdV5zfWaUVfzK25Z9nxNgcy-_1inynQ5O2zLZ19g0IDpi2YaZNd-7HhUUqW_luiZF_Uw4e6SEgMoXlF3gylrRpHpzgnnuZFs_8J7Usq1x4wgD7tiKomSE3y8--cp8QstC7Thv66Z7hwzfY6byPFfPeo5BD-1U7SyeFZj-TP9cYRQO-Gx9UJ-Vra3Eh1Vo-aa9k_99Q7hNgiewvpKkRiJztgJ6WEUbppapzahKbw_rpVQ7CNYlXksEz6eCCrlheFsLXKKqgna1Or1UXg-j-k5qFHNCyNvVklXBB2PSOXved3jhSNiqho4QLYpuNn44aWoCNSXP_RSVmIYENVXyO7y-4saGJ0zrAM2VEX7SWAnWHDICOnzBvpJOuA&u=https%3A%2F%2Fstackoverflow.com%2Fquestions%2F49221733%2Fdebugging-a-local-spark-executor-in-pycharm>
> as well.
>