Asif created SPARK-49881:
----------------------------

             Summary: Improving analyzer performance by skipping 
DeduplicateRelations rule conditionally
                 Key: SPARK-49881
                 URL: https://issues.apache.org/jira/browse/SPARK-49881
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 3.5.3, 4.0.0
            Reporter: Asif


In many cases, it has been observed that DeduplicateRelations rule, though 
essential, but by its nature has impacted query analysis time to big extent 
especially when dealing with large query plans.

It appears that in many situations we can guarantee that there would be no 
duplicate relations present and thus avoid applying the rule.

Those situations are :

1) When dataframe api's like select/filter which operate on existing dataframe 
are used.

 

Also if we store the MultiInstanceRelations in the QueryExecution ,for  a given 
plan, then we can make use of that information , while creating new dataframes 
where there is a possibility of duplicate relations ( like join, union, 
intersection etc).

If two datasets being unioned/intersected/joined ..etc have no common 
MultiInstanceRelation , then it should be safe to assume that there is no 
possibility of duplicate relations , there by allowing the Dedup rule to skip..

 

Atleast that is the idea.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to