Hello all! I'm working with PySpark trying to reproduce some of the results
we see on batch through streaming processes, just as a PoC for now. For
this, I'm thinking of trying to interpret the execution plan and eventually
write it back to Python (I'm doing something similar with pandas as well,
and I'd like both approaches to be as similar as possible).

Let me clarify with an example: suppose that starting with a `geometry.csv`
file with `width` and `height` I want to calculate the `area` doing this:

>>> geometry = spark.read.csv('geometry.csv', header=True)
>>> geometry = geometry.withColumn('area', F.col('width') * F.col('height'))

I would like to extract from the execution plan the fact that area is
calculated as the product of width * height. One possibility would be to
parse the execution plan:

>>> geometry.explain(True)

...
== Optimized Logical Plan ==
Project [width#45, height#46, (cast(width#45 as double) * cast(height#46 as
double)) AS area#64]
+- Relation [width#45,height#46] csv
...

>From the first line of the Logical Plan we can parse the formula "area =
height * width" and then write the function back in any language.

However, even though I'm getting the logical plan as a string, there has to
be some internal representation that I could leverage and avoid the string
parsing. Do you know if/how I can access that internal representation from
Python? I've been trying to navigate the scala source code to find it, but
this is definitely beyond my area of expertise, so any pointers would be
more than welcome.

Thanks in advance,
Pablo

Reply via email to