[spark] branch master updated: [SPARK-37447][SQL] Cache LogicalPlan.isStreaming() result in a lazy val

wenchen Mon, 29 Nov 2021 00:47:42 -0800

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new 5d09828  [SPARK-37447][SQL] Cache LogicalPlan.isStreaming() result in 
a lazy val
5d09828 is described below

commit 5d098286a003eac26d1fe6c026d8123f3cbe9e9c
Author: Josh Rosen <joshro...@databricks.com>
AuthorDate: Mon Nov 29 16:46:28 2021 +0800

    [SPARK-37447][SQL] Cache LogicalPlan.isStreaming() result in a lazy val
    
    ### What changes were proposed in this pull request?
    
    This PR adds caching to `LogicalPlan.isStreaming()`: the default 
implementation's result will now be cached in a `private lazy val`.
    
    ### Why are the changes needed?
    
    This improves the performance of the `DeduplicateRelations` analyzer rule.
    
    The default implementation of `isStreaming` recursively visits every node 
in the tree. `DeduplicateRelations.renewDuplicatedRelations` is recursively 
invoked on every node in the tree and each invocation calls `isStreaming`. This 
leads to `O(n^2)` invocations of `isStreaming` on leaf nodes.
    
    Caching `isStreaming` avoids this performance problem.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Correctness should be covered by existing tests.
    
    This significantly improved `DeduplicateRelations` performance in local 
microbenchmarking with large query plans (~20% reduction in that rule's runtime 
in one of my tests).
    
    Closes #34691 from JoshRosen/cache-LogicalPlan.isStreaming.
    
    Authored-by: Josh Rosen <joshro...@databricks.com>
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>
---
 .../org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala      | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
index 7c31a00..4aa7bf1 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
@@ -41,7 +41,8 @@ abstract class LogicalPlan
   def metadataOutput: Seq[Attribute] = children.flatMap(_.metadataOutput)
 
   /** Returns true if this subtree has data from a streaming data source. */
-  def isStreaming: Boolean = children.exists(_.isStreaming)
+  def isStreaming: Boolean = _isStreaming
+  private[this] lazy val _isStreaming = children.exists(_.isStreaming)
 
   override def verboseStringWithSuffix(maxFields: Int): String = {
     super.verboseString(maxFields) + statsCache.map(", " + 
_.toString).getOrElse("")

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-37447][SQL] Cache LogicalPlan.isStreaming() result in a lazy val

Reply via email to