Hi Pablo, Do you mean an in-memory plan? You can access one by implementing a Spark Listener. Here is an example from the Datahub project [1].
If you end up parsing the SQL plan string, you may consider using/extending Coral [2, 3]. There is already a POC for that. See some test cases [4]. Thanks, Walaa. [1] https://github.com/datahub-project/datahub/blob/master/metadata-integration/java/spark-lineage/src/main/java/datahub/spark/DatahubSparkListener.java [2] https://engineering.linkedin.com/blog/2020/coral [3] https://github.com/linkedin/coral [4] https://github.com/linkedin/coral/blob/master/coral-spark-plan/src/test/java/com/linkedin/coral/sparkplan/SparkPlanToIRRelConverterTest.java On Tue, May 3, 2022 at 1:18 AM Shay Elbaz <shay.el...@gm.com> wrote: > Hi Pablo, > > > > As you probably know, Spark SQL generates custom Java code for the SQL > functions. You can use geometry.debugCodegen() to print out the generated > code. > > > > Shay > > > > *From:* Pablo Alcain <pabloalc...@gmail.com> > *Sent:* Tuesday, May 3, 2022 6:07 AM > *To:* user@spark.apache.org > *Subject:* [EXTERNAL] Parse Execution Plan from PySpark > > > > *ATTENTION:* This email originated from outside of GM. > > > > > Hello all! I'm working with PySpark trying to reproduce some of the > results we see on batch through streaming processes, just as a PoC for now. > For this, I'm thinking of trying to interpret the execution plan and > eventually write it back to Python (I'm doing something similar with pandas > as well, and I'd like both approaches to be as similar as possible). > > > > Let me clarify with an example: suppose that starting with a > `geometry.csv` file with `width` and `height` I want to calculate the > `area` doing this: > > > > >>> geometry = spark.read.csv('geometry.csv', header=True) > > >>> geometry = geometry.withColumn('area', F.col('width') * > F.col('height')) > > > > I would like to extract from the execution plan the fact that area is > calculated as the product of width * height. One possibility would be to > parse the execution plan: > > > > >>> geometry.explain(True) > > > > ... > > == Optimized Logical Plan == > > Project [width#45, height#46, (cast(width#45 as double) * cast(height#46 > as double)) AS area#64] > +- Relation [width#45,height#46] csv > > ... > > > > From the first line of the Logical Plan we can parse the formula "area = > height * width" and then write the function back in any language. > > > > However, even though I'm getting the logical plan as a string, there has > to be some internal representation that I could leverage and avoid > the string parsing. Do you know if/how I can access that internal > representation from Python? I've been trying to navigate the scala source > code to find it, but this is definitely beyond my area of expertise, so any > pointers would be more than welcome. > > > > Thanks in advance, > > Pablo >