Amazing, it looks like parsing the execution plan from plain text can be a good first approach, at least for a proof of concept. I'll let you guys know how it works out! Thanks Walaa for those links, they are super useful.
On Tue, May 3, 2022 at 5:39 AM Walaa Eldin Moustafa <wa.moust...@gmail.com> wrote: > Hi Pablo, > > Do you mean an in-memory plan? You can access one by implementing a Spark > Listener. Here is an example from the Datahub project [1]. > > If you end up parsing the SQL plan string, you may consider > using/extending Coral [2, 3]. There is already a POC for that. See some > test cases [4]. > > Thanks, > Walaa. > > [1] > https://github.com/datahub-project/datahub/blob/master/metadata-integration/java/spark-lineage/src/main/java/datahub/spark/DatahubSparkListener.java > [2] https://engineering.linkedin.com/blog/2020/coral > [3] https://github.com/linkedin/coral > [4] > https://github.com/linkedin/coral/blob/master/coral-spark-plan/src/test/java/com/linkedin/coral/sparkplan/SparkPlanToIRRelConverterTest.java > > > On Tue, May 3, 2022 at 1:18 AM Shay Elbaz <shay.el...@gm.com> wrote: > >> Hi Pablo, >> >> >> >> As you probably know, Spark SQL generates custom Java code for the SQL >> functions. You can use geometry.debugCodegen() to print out the generated >> code. >> >> >> >> Shay >> >> >> >> *From:* Pablo Alcain <pabloalc...@gmail.com> >> *Sent:* Tuesday, May 3, 2022 6:07 AM >> *To:* user@spark.apache.org >> *Subject:* [EXTERNAL] Parse Execution Plan from PySpark >> >> >> >> *ATTENTION:* This email originated from outside of GM. >> >> >> >> >> Hello all! I'm working with PySpark trying to reproduce some of the >> results we see on batch through streaming processes, just as a PoC for now. >> For this, I'm thinking of trying to interpret the execution plan and >> eventually write it back to Python (I'm doing something similar with pandas >> as well, and I'd like both approaches to be as similar as possible). >> >> >> >> Let me clarify with an example: suppose that starting with a >> `geometry.csv` file with `width` and `height` I want to calculate the >> `area` doing this: >> >> >> >> >>> geometry = spark.read.csv('geometry.csv', header=True) >> >> >>> geometry = geometry.withColumn('area', F.col('width') * >> F.col('height')) >> >> >> >> I would like to extract from the execution plan the fact that area is >> calculated as the product of width * height. One possibility would be to >> parse the execution plan: >> >> >> >> >>> geometry.explain(True) >> >> >> >> ... >> >> == Optimized Logical Plan == >> >> Project [width#45, height#46, (cast(width#45 as double) * cast(height#46 >> as double)) AS area#64] >> +- Relation [width#45,height#46] csv >> >> ... >> >> >> >> From the first line of the Logical Plan we can parse the formula "area = >> height * width" and then write the function back in any language. >> >> >> >> However, even though I'm getting the logical plan as a string, there has >> to be some internal representation that I could leverage and avoid >> the string parsing. Do you know if/how I can access that internal >> representation from Python? I've been trying to navigate the scala source >> code to find it, but this is definitely beyond my area of expertise, so any >> pointers would be more than welcome. >> >> >> >> Thanks in advance, >> >> Pablo >> >