[ 
https://issues.apache.org/jira/browse/SPARK-20413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20413:
------------------------------------

    Assignee: Apache Spark

> New Optimizer Hint to prevent collapsing of adjacent projections
> ----------------------------------------------------------------
>
>                 Key: SPARK-20413
>                 URL: https://issues.apache.org/jira/browse/SPARK-20413
>             Project: Spark
>          Issue Type: Improvement
>          Components: Optimizer, PySpark, SQL
>    Affects Versions: 2.1.0
>            Reporter: Michael Styles
>            Assignee: Apache Spark
>
> I am proposing that a new optimizer hint called NO_COLLAPSE be introduced. 
> This hint is essentially identical to Oracle's NO_MERGE hint. 
> Let me first give an example of why I am proposing this. 
> df1 = sc.sql.createDataFrame([(1, "abc")], ["id", "user_agent"]) 
> df2 = df1.withColumn("ua", user_agent_details(df1["user_agent"])) 
> df3 = df2.select(df2["ua"].device_form_factor.alias("c1"), 
> df2["ua"].browser_version.alias("c2")) 
> df3.explain(True) 
> == Parsed Logical Plan == 
> 'Project [ua#85[device_form_factor] AS c1#90, ua#85[browser_version] AS 
> c2#91] 
> +- Project [id#80L, user_agent#81, UDF(user_agent#81) AS ua#85] 
>    +- LogicalRDD [id#80L, user_agent#81] 
> == Analyzed Logical Plan == 
> c1: string, c2: string 
> Project [ua#85.device_form_factor AS c1#90, ua#85.browser_version AS c2#91] 
> +- Project [id#80L, user_agent#81, UDF(user_agent#81) AS ua#85] 
>    +- LogicalRDD [id#80L, user_agent#81] 
> == Optimized Logical Plan == 
> Project [UDF(user_agent#81).device_form_factor AS c1#90, 
> UDF(user_agent#81).browser_version AS c2#91] 
> +- LogicalRDD [id#80L, user_agent#81] 
> == Physical Plan == 
> *Project [UDF(user_agent#81).device_form_factor AS c1#90, 
> UDF(user_agent#81).browser_version AS c2#91] 
> +- Scan ExistingRDD[id#80L,user_agent#81] 
> user_agent_details is a user-defined function that returns a struct. As can 
> be seen from the generated query plan, the function is being executed 
> multiple times which could lead to performance issues. This is due to the 
> CollapseProject optimizer rule that collapses adjacent projections. 
> I'm proposing a hint that prevent the optimizer from collapsing adjacent 
> projections. A new function called 'no_collapse' would be introduced for this 
> purpose. Consider the following example and generated query plan. 
> df1 = sc.sql.createDataFrame([(1, "abc")], ["id", "user_agent"]) 
> df2 = F.no_collapse(df1.withColumn("ua", 
> user_agent_details(df1["user_agent"]))) 
> df3 = df2.select(df2["ua"].device_form_factor.alias("c1"), 
> df2["ua"].browser_version.alias("c2")) 
> df3.explain(True) 
> == Parsed Logical Plan == 
> 'Project [ua#69[device_form_factor] AS c1#75, ua#69[browser_version] AS 
> c2#76] 
> +- NoCollapseHint 
>    +- Project [id#64L, user_agent#65, UDF(user_agent#65) AS ua#69] 
>       +- LogicalRDD [id#64L, user_agent#65] 
> == Analyzed Logical Plan == 
> c1: string, c2: string 
> Project [ua#69.device_form_factor AS c1#75, ua#69.browser_version AS c2#76] 
> +- NoCollapseHint 
>    +- Project [id#64L, user_agent#65, UDF(user_agent#65) AS ua#69] 
>       +- LogicalRDD [id#64L, user_agent#65] 
> == Optimized Logical Plan == 
> Project [ua#69.device_form_factor AS c1#75, ua#69.browser_version AS c2#76] 
> +- NoCollapseHint 
>    +- Project [UDF(user_agent#65) AS ua#69] 
>       +- LogicalRDD [id#64L, user_agent#65] 
> == Physical Plan == 
> *Project [ua#69.device_form_factor AS c1#75, ua#69.browser_version AS c2#76] 
> +- *Project [UDF(user_agent#65) AS ua#69] 
>    +- Scan ExistingRDD[id#64L,user_agent#65] 
> As can be seen from the query plan, the user-defined function is now 
> evaluated once per row.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to