[jira] [Resolved] (SPARK-47818) Introduce plan cache in SparkConnectPlanner to improve performance of Analyze requests

Hyukjin Kwon (Jira) Tue, 16 Apr 2024 00:28:07 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-47818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hyukjin Kwon resolved SPARK-47818.
----------------------------------
    Resolution: Fixed

Issue resolved by pull request 46012
[https://github.com/apache/spark/pull/46012]

> Introduce plan cache in SparkConnectPlanner to improve performance of Analyze 
> requests
> --------------------------------------------------------------------------------------
>
>                 Key: SPARK-47818
>                 URL: https://issues.apache.org/jira/browse/SPARK-47818
>             Project: Spark
>          Issue Type: Improvement
>          Components: Connect
>    Affects Versions: 4.0.0
>            Reporter: Xi Lyu
>            Assignee: Xi Lyu
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.0.0
>
>
> While building the DataFrame step by step, each time a new DataFrame is 
> generated with an empty schema, which is lazily computed on access. However, 
> if a user's code frequently accesses the schema of these new DataFrames using 
> methods such as `df.columns`, it will result in a large number of Analyze 
> requests to the server. Each time, the entire plan needs to be reanalyzed, 
> leading to poor performance, especially when constructing highly complex 
> plans.
> Now, by introducing plan cache in SparkConnectPlanner, we aim to reduce the 
> overhead of repeated analysis during this process. This is achieved by saving 
> significant computation if the resolved logical plan of a subtree of can be 
> cached.
> A minimal example of the problem:
> {code:java}
> import pyspark.sql.functions as F
> df = spark.range(10)
> for i in range(200):
>   if str(i) not in df.columns: # <-- The df.columns call causes a new Analyze 
> request in every iteration
>     df = df.withColumn(str(i), F.col("id") + i)
> df.show() {code}
> With this patch, the performance of the above code improved from ~110s to ~5s.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47818) Introduce plan cache in SparkConnectPlanner to improve performance of Analyze requests

Reply via email to