Re: [EXTERNAL] Parse Execution Plan from PySpark

Walaa Eldin Moustafa Tue, 03 May 2022 01:39:24 -0700

Hi Pablo,

Do you mean an in-memory plan? You can access one by implementing a Spark
Listener. Here is an example from the Datahub project [1].


If you end up parsing the SQL plan string, you may consider using/extending
Coral [2, 3]. There is already a POC for that. See some test cases [4].

Thanks,
Walaa.

[1]
https://github.com/datahub-project/datahub/blob/master/metadata-integration/java/spark-lineage/src/main/java/datahub/spark/DatahubSparkListener.java
[2] https://engineering.linkedin.com/blog/2020/coral
[3] https://github.com/linkedin/coral
[4]
https://github.com/linkedin/coral/blob/master/coral-spark-plan/src/test/java/com/linkedin/coral/sparkplan/SparkPlanToIRRelConverterTest.java


On Tue, May 3, 2022 at 1:18 AM Shay Elbaz <shay.el...@gm.com> wrote:

> Hi Pablo,
>
>
>
> As you probably know, Spark SQL generates custom Java code for the SQL
> functions. You can use geometry.debugCodegen() to print out the generated
> code.
>
>
>
> Shay
>
>
>
> *From:* Pablo Alcain <pabloalc...@gmail.com>
> *Sent:* Tuesday, May 3, 2022 6:07 AM
> *To:* user@spark.apache.org
> *Subject:* [EXTERNAL] Parse Execution Plan from PySpark
>
>
>
> *ATTENTION:* This email originated from outside of GM.
>
>
>
>
> Hello all! I'm working with PySpark trying to reproduce some of the
> results we see on batch through streaming processes, just as a PoC for now.
> For this, I'm thinking of trying to interpret the execution plan and
> eventually write it back to Python (I'm doing something similar with pandas
> as well, and I'd like both approaches to be as similar as possible).
>
>
>
> Let me clarify with an example: suppose that starting with a
> `geometry.csv` file with `width` and `height` I want to calculate the
> `area` doing this:
>
>
>
> >>> geometry = spark.read.csv('geometry.csv', header=True)
>
> >>> geometry = geometry.withColumn('area', F.col('width') *
> F.col('height'))
>
>
>
> I would like to extract from the execution plan the fact that area is
> calculated as the product of width * height. One possibility would be to
> parse the execution plan:
>
>
>
> >>> geometry.explain(True)
>
>
>
> ...
>
> == Optimized Logical Plan ==
>
> Project [width#45, height#46, (cast(width#45 as double) * cast(height#46
> as double)) AS area#64]
> +- Relation [width#45,height#46] csv
>
> ...
>
>
>
> From the first line of the Logical Plan we can parse the formula "area =
> height * width" and then write the function back in any language.
>
>
>
> However, even though I'm getting the logical plan as a string, there has
> to be some internal representation that I could leverage and avoid
> the string parsing. Do you know if/how I can access that internal
> representation from Python? I've been trying to navigate the scala source
> code to find it, but this is definitely beyond my area of expertise, so any
> pointers would be more than welcome.
>
>
>
> Thanks in advance,
>
> Pablo
>

Re: [EXTERNAL] Parse Execution Plan from PySpark

Reply via email to