Hi Pablo,

As you probably know, Spark SQL generates custom Java code for the SQL 
functions. You can use geometry.debugCodegen() to print out the generated code.

Shay

From: Pablo Alcain <pabloalc...@gmail.com>
Sent: Tuesday, May 3, 2022 6:07 AM
To: user@spark.apache.org
Subject: [EXTERNAL] Parse Execution Plan from PySpark

ATTENTION: This email originated from outside of GM.



Hello all! I'm working with PySpark trying to reproduce some of the results we 
see on batch through streaming processes, just as a PoC for now. For this, I'm 
thinking of trying to interpret the execution plan and eventually write it back 
to Python (I'm doing something similar with pandas as well, and I'd like both 
approaches to be as similar as possible).

Let me clarify with an example: suppose that starting with a `geometry.csv` 
file with `width` and `height` I want to calculate the `area` doing this:

>>> geometry = spark.read.csv('geometry.csv', header=True)
>>> geometry = geometry.withColumn('area', F.col('width') * F.col('height'))

I would like to extract from the execution plan the fact that area is 
calculated as the product of width * height. One possibility would be to parse 
the execution plan:

>>> geometry.explain(True)

...
== Optimized Logical Plan ==
Project [width#45, height#46, (cast(width#45 as double) * cast(height#46 as 
double)) AS area#64]
+- Relation [width#45,height#46] csv
...

From the first line of the Logical Plan we can parse the formula "area = height 
* width" and then write the function back in any language.

However, even though I'm getting the logical plan as a string, there has to be 
some internal representation that I could leverage and avoid the string 
parsing. Do you know if/how I can access that internal representation from 
Python? I've been trying to navigate the scala source code to find it, but this 
is definitely beyond my area of expertise, so any pointers would be more than 
welcome.

Thanks in advance,
Pablo

Reply via email to