Github user nsyca commented on a diff in the pull request:

    https://github.com/apache/spark/pull/15763#discussion_r86653844
  
    --- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
    @@ -1044,6 +1044,34 @@ class Analyzer(
               failOnOuterReference(p)
               p
           }
    +
    +      // SPARK-17348
    +      // Looking for a potential incorrect result case.
    +      // When a correlated predicate is a non-equality predicate
    +      // it must be placed at the immediate child operator.
    +      // Otherwise, the pull up of the correlated predicate
    +      // will generate a plan with a different semantics
    +      // which could return incorrect result.
    +      var continue : Boolean = true
    --- End diff --
    
    One technique that I know of being used to transform correlation queries to 
queries with no correlation is outlined in this 1996 IEEE Data Engineering 
paper.
    
        Complex query decorrelation
        P. Seshadri; H. Pirahesh; T. Y. C. Leung
        Data Engineering, 1996. Proceedings of the Twelfth International 
Conference on
        Pages: 450 - 458
    
    Distributed systems aggravate the performance impact of correlated queries 
from the movement of the entire data set of the subqueries to where the data of 
the outer tables reside. This processing is similar to the 
`BroadcastNestedLoopJoinExec` in Spark.
    
    The idea behind the paper is to build a duplicate portion of the outer 
tables and de-correlate the original subquery by joining the duplicate portion 
within the subquery. The algorithm is claimed to be generic and can be applied 
to all forms of correlations, both shallow correlation where the correlated 
point is immediately below the operation over the outer table(s), and deep 
correlation, where the correlated point is at arbitrary level below the 
operation over the outer tables.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to