[ https://issues.apache.org/jira/browse/SPARK-47818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon reassigned SPARK-47818: ------------------------------------ Assignee: Xi Lyu > Introduce plan cache in SparkConnectPlanner to improve performance of Analyze > requests > -------------------------------------------------------------------------------------- > > Key: SPARK-47818 > URL: https://issues.apache.org/jira/browse/SPARK-47818 > Project: Spark > Issue Type: Improvement > Components: Connect > Affects Versions: 4.0.0 > Reporter: Xi Lyu > Assignee: Xi Lyu > Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > While building the DataFrame step by step, each time a new DataFrame is > generated with an empty schema, which is lazily computed on access. However, > if a user's code frequently accesses the schema of these new DataFrames using > methods such as `df.columns`, it will result in a large number of Analyze > requests to the server. Each time, the entire plan needs to be reanalyzed, > leading to poor performance, especially when constructing highly complex > plans. > Now, by introducing plan cache in SparkConnectPlanner, we aim to reduce the > overhead of repeated analysis during this process. This is achieved by saving > significant computation if the resolved logical plan of a subtree of can be > cached. > A minimal example of the problem: > {code:java} > import pyspark.sql.functions as F > df = spark.range(10) > for i in range(200): > if str(i) not in df.columns: # <-- The df.columns call causes a new Analyze > request in every iteration > df = df.withColumn(str(i), F.col("id") + i) > df.show() {code} > With this patch, the performance of the above code improved from ~110s to ~5s. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org