Amazing, it looks like parsing the execution plan from plain text can be a
good first approach, at least for a proof of concept. I'll let you guys
know how it works out! Thanks Walaa for those links, they are super useful.

On Tue, May 3, 2022 at 5:39 AM Walaa Eldin Moustafa <wa.moust...@gmail.com>
wrote:

> Hi Pablo,
>
> Do you mean an in-memory plan? You can access one by implementing a Spark
> Listener. Here is an example from the Datahub project [1].
>
> If you end up parsing the SQL plan string, you may consider
> using/extending Coral [2, 3]. There is already a POC for that. See some
> test cases [4].
>
> Thanks,
> Walaa.
>
> [1]
> https://github.com/datahub-project/datahub/blob/master/metadata-integration/java/spark-lineage/src/main/java/datahub/spark/DatahubSparkListener.java
> [2] https://engineering.linkedin.com/blog/2020/coral
> [3] https://github.com/linkedin/coral
> [4]
> https://github.com/linkedin/coral/blob/master/coral-spark-plan/src/test/java/com/linkedin/coral/sparkplan/SparkPlanToIRRelConverterTest.java
>
>
> On Tue, May 3, 2022 at 1:18 AM Shay Elbaz <shay.el...@gm.com> wrote:
>
>> Hi Pablo,
>>
>>
>>
>> As you probably know, Spark SQL generates custom Java code for the SQL
>> functions. You can use geometry.debugCodegen() to print out the generated
>> code.
>>
>>
>>
>> Shay
>>
>>
>>
>> *From:* Pablo Alcain <pabloalc...@gmail.com>
>> *Sent:* Tuesday, May 3, 2022 6:07 AM
>> *To:* user@spark.apache.org
>> *Subject:* [EXTERNAL] Parse Execution Plan from PySpark
>>
>>
>>
>> *ATTENTION:* This email originated from outside of GM.
>>
>>
>>
>>
>> Hello all! I'm working with PySpark trying to reproduce some of the
>> results we see on batch through streaming processes, just as a PoC for now.
>> For this, I'm thinking of trying to interpret the execution plan and
>> eventually write it back to Python (I'm doing something similar with pandas
>> as well, and I'd like both approaches to be as similar as possible).
>>
>>
>>
>> Let me clarify with an example: suppose that starting with a
>> `geometry.csv` file with `width` and `height` I want to calculate the
>> `area` doing this:
>>
>>
>>
>> >>> geometry = spark.read.csv('geometry.csv', header=True)
>>
>> >>> geometry = geometry.withColumn('area', F.col('width') *
>> F.col('height'))
>>
>>
>>
>> I would like to extract from the execution plan the fact that area is
>> calculated as the product of width * height. One possibility would be to
>> parse the execution plan:
>>
>>
>>
>> >>> geometry.explain(True)
>>
>>
>>
>> ...
>>
>> == Optimized Logical Plan ==
>>
>> Project [width#45, height#46, (cast(width#45 as double) * cast(height#46
>> as double)) AS area#64]
>> +- Relation [width#45,height#46] csv
>>
>> ...
>>
>>
>>
>> From the first line of the Logical Plan we can parse the formula "area =
>> height * width" and then write the function back in any language.
>>
>>
>>
>> However, even though I'm getting the logical plan as a string, there has
>> to be some internal representation that I could leverage and avoid
>> the string parsing. Do you know if/how I can access that internal
>> representation from Python? I've been trying to navigate the scala source
>> code to find it, but this is definitely beyond my area of expertise, so any
>> pointers would be more than welcome.
>>
>>
>>
>> Thanks in advance,
>>
>> Pablo
>>
>

Reply via email to