[jira] [Updated] (SPARK-49881) SPIP : Improving analyzer performance by skipping DeduplicateRelations rule conditionally

Asif (Jira) Fri, 04 Oct 2024 11:49:04 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-49881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Asif updated SPARK-49881:
-------------------------
    Description: 
In many cases, it has been observed that DeduplicateRelations rule, though 
essential, but by its nature has impacted query analysis time to big extent 
especially when dealing with large query plans.

It appears that in many situations we can guarantee that there would be no 
duplicate relations present and thus avoid applying the rule.

Those situations are :

1) When dataframe api's like select/filter which operate on existing dataframe, 
are used.

 

Also if we store the MultiInstanceRelations in the QueryExecution ,for  a given 
plan, then we can make use of that information , while creating new dataframes 
where there is a possibility of duplicate relations ( like join, union, 
intersection etc).

If two datasets being unioned/intersected/joined ..etc have no common 
MultiInstanceRelation , then it should be safe to assume that there is no 
possibility of duplicate relations , there by allowing skipping the Dedup rule

 

Atleast that is the idea.

  was:
In many cases, it has been observed that DeduplicateRelations rule, though 
essential, but by its nature has impacted query analysis time to big extent 
especially when dealing with large query plans.

It appears that in many situations we can guarantee that there would be no 
duplicate relations present and thus avoid applying the rule.

Those situations are :

1) When dataframe api's like select/filter which operate on existing dataframe, 
are used.

 

Also if we store the MultiInstanceRelations in the QueryExecution ,for  a given 
plan, then we can make use of that information , while creating new dataframes 
where there is a possibility of duplicate relations ( like join, union, 
intersection etc).

If two datasets being unioned/intersected/joined ..etc have no common 
MultiInstanceRelation , then it should be safe to assume that there is no 
possibility of duplicate relations , there by allowing the Dedup rule to skip..

 

Atleast that is the idea.


> SPIP : Improving analyzer performance by skipping DeduplicateRelations rule 
> conditionally
> -----------------------------------------------------------------------------------------
>
>                 Key: SPARK-49881
>                 URL: https://issues.apache.org/jira/browse/SPARK-49881
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 4.0.0, 3.5.3
>            Reporter: Asif
>            Priority: Major
>
> In many cases, it has been observed that DeduplicateRelations rule, though 
> essential, but by its nature has impacted query analysis time to big extent 
> especially when dealing with large query plans.
> It appears that in many situations we can guarantee that there would be no 
> duplicate relations present and thus avoid applying the rule.
> Those situations are :
> 1) When dataframe api's like select/filter which operate on existing 
> dataframe, are used.
>  
> Also if we store the MultiInstanceRelations in the QueryExecution ,for  a 
> given plan, then we can make use of that information , while creating new 
> dataframes where there is a possibility of duplicate relations ( like join, 
> union, intersection etc).
> If two datasets being unioned/intersected/joined ..etc have no common 
> MultiInstanceRelation , then it should be safe to assume that there is no 
> possibility of duplicate relations , there by allowing skipping the Dedup rule
>  
> Atleast that is the idea.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-49881) SPIP : Improving analyzer performance by skipping DeduplicateRelations rule conditionally

Reply via email to