[ https://issues.apache.org/jira/browse/SPARK-49881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Asif updated SPARK-49881: ------------------------- Description: In many cases, it has been observed that DeduplicateRelations rule, though essential, but by its nature has impacted query analysis time to big extent especially when dealing with large query plans. It appears that in many situations we can guarantee that there would be no duplicate relations present and thus avoid applying the rule. Those situations are : 1) When dataframe api's like select/filter which operate on existing dataframe, are used. Also if we store the MultiInstanceRelations in the QueryExecution ,for a given plan, then we can make use of that information , while creating new dataframes where there is a possibility of duplicate relations ( like join, union, intersection etc). If two datasets being unioned/intersected/joined ..etc have no common MultiInstanceRelation , then it should be safe to assume that there is no possibility of duplicate relations , there by allowing skipping the Dedup rule Atleast that is the idea. was: In many cases, it has been observed that DeduplicateRelations rule, though essential, but by its nature has impacted query analysis time to big extent especially when dealing with large query plans. It appears that in many situations we can guarantee that there would be no duplicate relations present and thus avoid applying the rule. Those situations are : 1) When dataframe api's like select/filter which operate on existing dataframe, are used. Also if we store the MultiInstanceRelations in the QueryExecution ,for a given plan, then we can make use of that information , while creating new dataframes where there is a possibility of duplicate relations ( like join, union, intersection etc). If two datasets being unioned/intersected/joined ..etc have no common MultiInstanceRelation , then it should be safe to assume that there is no possibility of duplicate relations , there by allowing the Dedup rule to skip.. Atleast that is the idea. > SPIP : Improving analyzer performance by skipping DeduplicateRelations rule > conditionally > ----------------------------------------------------------------------------------------- > > Key: SPARK-49881 > URL: https://issues.apache.org/jira/browse/SPARK-49881 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 4.0.0, 3.5.3 > Reporter: Asif > Priority: Major > > In many cases, it has been observed that DeduplicateRelations rule, though > essential, but by its nature has impacted query analysis time to big extent > especially when dealing with large query plans. > It appears that in many situations we can guarantee that there would be no > duplicate relations present and thus avoid applying the rule. > Those situations are : > 1) When dataframe api's like select/filter which operate on existing > dataframe, are used. > > Also if we store the MultiInstanceRelations in the QueryExecution ,for a > given plan, then we can make use of that information , while creating new > dataframes where there is a possibility of duplicate relations ( like join, > union, intersection etc). > If two datasets being unioned/intersected/joined ..etc have no common > MultiInstanceRelation , then it should be safe to assume that there is no > possibility of duplicate relations , there by allowing skipping the Dedup rule > > Atleast that is the idea. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org