Re: [EXTERNAL] Parse Execution Plan from PySpark

2022-05-05 Thread Pablo Alcain
Amazing, it looks like parsing the execution plan from plain text can be a
good first approach, at least for a proof of concept. I'll let you guys
know how it works out! Thanks Walaa for those links, they are super useful.

On Tue, May 3, 2022 at 5:39 AM Walaa Eldin Moustafa 
wrote:

> Hi Pablo,
>
> Do you mean an in-memory plan? You can access one by implementing a Spark
> Listener. Here is an example from the Datahub project [1].
>
> If you end up parsing the SQL plan string, you may consider
> using/extending Coral [2, 3]. There is already a POC for that. See some
> test cases [4].
>
> Thanks,
> Walaa.
>
> [1]
> https://github.com/datahub-project/datahub/blob/master/metadata-integration/java/spark-lineage/src/main/java/datahub/spark/DatahubSparkListener.java
> [2] https://engineering.linkedin.com/blog/2020/coral
> [3] https://github.com/linkedin/coral
> [4]
> https://github.com/linkedin/coral/blob/master/coral-spark-plan/src/test/java/com/linkedin/coral/sparkplan/SparkPlanToIRRelConverterTest.java
>
>
> On Tue, May 3, 2022 at 1:18 AM Shay Elbaz  wrote:
>
>> Hi Pablo,
>>
>>
>>
>> As you probably know, Spark SQL generates custom Java code for the SQL
>> functions. You can use geometry.debugCodegen() to print out the generated
>> code.
>>
>>
>>
>> Shay
>>
>>
>>
>> *From:* Pablo Alcain 
>> *Sent:* Tuesday, May 3, 2022 6:07 AM
>> *To:* user@spark.apache.org
>> *Subject:* [EXTERNAL] Parse Execution Plan from PySpark
>>
>>
>>
>> *ATTENTION:* This email originated from outside of GM.
>>
>>
>>
>>
>> Hello all! I'm working with PySpark trying to reproduce some of the
>> results we see on batch through streaming processes, just as a PoC for now.
>> For this, I'm thinking of trying to interpret the execution plan and
>> eventually write it back to Python (I'm doing something similar with pandas
>> as well, and I'd like both approaches to be as similar as possible).
>>
>>
>>
>> Let me clarify with an example: suppose that starting with a
>> `geometry.csv` file with `width` and `height` I want to calculate the
>> `area` doing this:
>>
>>
>>
>> >>> geometry = spark.read.csv('geometry.csv', header=True)
>>
>> >>> geometry = geometry.withColumn('area', F.col('width') *
>> F.col('height'))
>>
>>
>>
>> I would like to extract from the execution plan the fact that area is
>> calculated as the product of width * height. One possibility would be to
>> parse the execution plan:
>>
>>
>>
>> >>> geometry.explain(True)
>>
>>
>>
>> ...
>>
>> == Optimized Logical Plan ==
>>
>> Project [width#45, height#46, (cast(width#45 as double) * cast(height#46
>> as double)) AS area#64]
>> +- Relation [width#45,height#46] csv
>>
>> ...
>>
>>
>>
>> From the first line of the Logical Plan we can parse the formula "area =
>> height * width" and then write the function back in any language.
>>
>>
>>
>> However, even though I'm getting the logical plan as a string, there has
>> to be some internal representation that I could leverage and avoid
>> the string parsing. Do you know if/how I can access that internal
>> representation from Python? I've been trying to navigate the scala source
>> code to find it, but this is definitely beyond my area of expertise, so any
>> pointers would be more than welcome.
>>
>>
>>
>> Thanks in advance,
>>
>> Pablo
>>
>


Re: [EXTERNAL] Parse Execution Plan from PySpark

2022-05-03 Thread Walaa Eldin Moustafa
Hi Pablo,

Do you mean an in-memory plan? You can access one by implementing a Spark
Listener. Here is an example from the Datahub project [1].

If you end up parsing the SQL plan string, you may consider using/extending
Coral [2, 3]. There is already a POC for that. See some test cases [4].

Thanks,
Walaa.

[1]
https://github.com/datahub-project/datahub/blob/master/metadata-integration/java/spark-lineage/src/main/java/datahub/spark/DatahubSparkListener.java
[2] https://engineering.linkedin.com/blog/2020/coral
[3] https://github.com/linkedin/coral
[4]
https://github.com/linkedin/coral/blob/master/coral-spark-plan/src/test/java/com/linkedin/coral/sparkplan/SparkPlanToIRRelConverterTest.java


On Tue, May 3, 2022 at 1:18 AM Shay Elbaz  wrote:

> Hi Pablo,
>
>
>
> As you probably know, Spark SQL generates custom Java code for the SQL
> functions. You can use geometry.debugCodegen() to print out the generated
> code.
>
>
>
> Shay
>
>
>
> *From:* Pablo Alcain 
> *Sent:* Tuesday, May 3, 2022 6:07 AM
> *To:* user@spark.apache.org
> *Subject:* [EXTERNAL] Parse Execution Plan from PySpark
>
>
>
> *ATTENTION:* This email originated from outside of GM.
>
>
>
>
> Hello all! I'm working with PySpark trying to reproduce some of the
> results we see on batch through streaming processes, just as a PoC for now.
> For this, I'm thinking of trying to interpret the execution plan and
> eventually write it back to Python (I'm doing something similar with pandas
> as well, and I'd like both approaches to be as similar as possible).
>
>
>
> Let me clarify with an example: suppose that starting with a
> `geometry.csv` file with `width` and `height` I want to calculate the
> `area` doing this:
>
>
>
> >>> geometry = spark.read.csv('geometry.csv', header=True)
>
> >>> geometry = geometry.withColumn('area', F.col('width') *
> F.col('height'))
>
>
>
> I would like to extract from the execution plan the fact that area is
> calculated as the product of width * height. One possibility would be to
> parse the execution plan:
>
>
>
> >>> geometry.explain(True)
>
>
>
> ...
>
> == Optimized Logical Plan ==
>
> Project [width#45, height#46, (cast(width#45 as double) * cast(height#46
> as double)) AS area#64]
> +- Relation [width#45,height#46] csv
>
> ...
>
>
>
> From the first line of the Logical Plan we can parse the formula "area =
> height * width" and then write the function back in any language.
>
>
>
> However, even though I'm getting the logical plan as a string, there has
> to be some internal representation that I could leverage and avoid
> the string parsing. Do you know if/how I can access that internal
> representation from Python? I've been trying to navigate the scala source
> code to find it, but this is definitely beyond my area of expertise, so any
> pointers would be more than welcome.
>
>
>
> Thanks in advance,
>
> Pablo
>


RE: [EXTERNAL] Parse Execution Plan from PySpark

2022-05-03 Thread Shay Elbaz
Hi Pablo,

As you probably know, Spark SQL generates custom Java code for the SQL 
functions. You can use geometry.debugCodegen() to print out the generated code.

Shay

From: Pablo Alcain 
Sent: Tuesday, May 3, 2022 6:07 AM
To: user@spark.apache.org
Subject: [EXTERNAL] Parse Execution Plan from PySpark

ATTENTION: This email originated from outside of GM.



Hello all! I'm working with PySpark trying to reproduce some of the results we 
see on batch through streaming processes, just as a PoC for now. For this, I'm 
thinking of trying to interpret the execution plan and eventually write it back 
to Python (I'm doing something similar with pandas as well, and I'd like both 
approaches to be as similar as possible).

Let me clarify with an example: suppose that starting with a `geometry.csv` 
file with `width` and `height` I want to calculate the `area` doing this:

>>> geometry = spark.read.csv('geometry.csv', header=True)
>>> geometry = geometry.withColumn('area', F.col('width') * F.col('height'))

I would like to extract from the execution plan the fact that area is 
calculated as the product of width * height. One possibility would be to parse 
the execution plan:

>>> geometry.explain(True)

...
== Optimized Logical Plan ==
Project [width#45, height#46, (cast(width#45 as double) * cast(height#46 as 
double)) AS area#64]
+- Relation [width#45,height#46] csv
...

From the first line of the Logical Plan we can parse the formula "area = height 
* width" and then write the function back in any language.

However, even though I'm getting the logical plan as a string, there has to be 
some internal representation that I could leverage and avoid the string 
parsing. Do you know if/how I can access that internal representation from 
Python? I've been trying to navigate the scala source code to find it, but this 
is definitely beyond my area of expertise, so any pointers would be more than 
welcome.

Thanks in advance,
Pablo