Asif created SPARK-49881: ---------------------------- Summary: Improving analyzer performance by skipping DeduplicateRelations rule conditionally Key: SPARK-49881 URL: https://issues.apache.org/jira/browse/SPARK-49881 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.5.3, 4.0.0 Reporter: Asif
In many cases, it has been observed that DeduplicateRelations rule, though essential, but by its nature has impacted query analysis time to big extent especially when dealing with large query plans. It appears that in many situations we can guarantee that there would be no duplicate relations present and thus avoid applying the rule. Those situations are : 1) When dataframe api's like select/filter which operate on existing dataframe are used. Also if we store the MultiInstanceRelations in the QueryExecution ,for a given plan, then we can make use of that information , while creating new dataframes where there is a possibility of duplicate relations ( like join, union, intersection etc). If two datasets being unioned/intersected/joined ..etc have no common MultiInstanceRelation , then it should be safe to assume that there is no possibility of duplicate relations , there by allowing the Dedup rule to skip.. Atleast that is the idea. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org