Doesn't common sub expression elimination address this issue as well? On Thu, Apr 20, 2017 at 6:40 AM Herman van Hövell tot Westerflier < hvanhov...@databricks.com> wrote:
> Hi Michael, > > This sounds like a good idea. Can you open a JIRA to track this? > > My initial feedback on your proposal would be that you might want to > express the no_collapse at the expression level and not at the plan level. > > HTH > > On Thu, Apr 20, 2017 at 3:31 PM, Michael Styles < > michael.sty...@shopify.com> wrote: > >> Hello, >> >> I am in the process of putting together a PR that introduces a new hint >> called NO_COLLAPSE. This hint is essentially identical to Oracle's NO_MERGE >> hint. >> >> Let me first give an example of why I am proposing this. >> >> df1 = sc.sql.createDataFrame([(1, "abc")], ["id", "user_agent"]) >> df2 = df1.withColumn("ua", user_agent_details(df1["user_agent"])) >> df3 = df2.select(df2["ua"].device_form_factor.alias("c1"), >> df2["ua"].browser_version.alias("c2")) >> df3.explain(True) >> >> == Parsed Logical Plan == >> 'Project [ua#85[device_form_factor] AS c1#90, ua#85[browser_version] AS >> c2#91] >> +- Project [id#80L, user_agent#81, UDF(user_agent#81) AS ua#85] >> +- LogicalRDD [id#80L, user_agent#81] >> >> == Analyzed Logical Plan == >> c1: string, c2: string >> Project [ua#85.device_form_factor AS c1#90, ua#85.browser_version AS >> c2#91] >> +- Project [id#80L, user_agent#81, UDF(user_agent#81) AS ua#85] >> +- LogicalRDD [id#80L, user_agent#81] >> >> == Optimized Logical Plan == >> Project [UDF(user_agent#81).device_form_factor AS c1#90, >> UDF(user_agent#81).browser_version AS c2#91] >> +- LogicalRDD [id#80L, user_agent#81] >> >> == Physical Plan == >> *Project [UDF(user_agent#81).device_form_factor AS c1#90, >> UDF(user_agent#81).browser_version AS c2#91] >> +- Scan ExistingRDD[id#80L,user_agent#81] >> >> user_agent_details is a user-defined function that returns a struct. As >> can be seen from the generated query plan, the function is being executed >> multiple times which could lead to performance issues. This is due to the >> CollapseProject optimizer rule that collapses adjacent projections. >> >> I'm proposing a hint that prevent the optimizer from collapsing adjacent >> projections. A new function called 'no_collapse' would be introduced for >> this purpose. Consider the following example and generated query plan. >> >> df1 = sc.sql.createDataFrame([(1, "abc")], ["id", "user_agent"]) >> df2 = F.no_collapse(df1.withColumn("ua", >> user_agent_details(df1["user_agent"]))) >> df3 = df2.select(df2["ua"].device_form_factor.alias("c1"), >> df2["ua"].browser_version.alias("c2")) >> df3.explain(True) >> >> == Parsed Logical Plan == >> 'Project [ua#69[device_form_factor] AS c1#75, ua#69[browser_version] AS >> c2#76] >> +- NoCollapseHint >> +- Project [id#64L, user_agent#65, UDF(user_agent#65) AS ua#69] >> +- LogicalRDD [id#64L, user_agent#65] >> >> == Analyzed Logical Plan == >> c1: string, c2: string >> Project [ua#69.device_form_factor AS c1#75, ua#69.browser_version AS >> c2#76] >> +- NoCollapseHint >> +- Project [id#64L, user_agent#65, UDF(user_agent#65) AS ua#69] >> +- LogicalRDD [id#64L, user_agent#65] >> >> == Optimized Logical Plan == >> Project [ua#69.device_form_factor AS c1#75, ua#69.browser_version AS >> c2#76] >> +- NoCollapseHint >> +- Project [UDF(user_agent#65) AS ua#69] >> +- LogicalRDD [id#64L, user_agent#65] >> >> == Physical Plan == >> *Project [ua#69.device_form_factor AS c1#75, ua#69.browser_version AS >> c2#76] >> +- *Project [UDF(user_agent#65) AS ua#69] >> +- Scan ExistingRDD[id#64L,user_agent#65] >> >> As can be seen from the query plan, the user-defined function is now >> evaluated once per row. >> >> I would like to get some feedback on this proposal. >> >> Thanks. >> >> > > > -- > > Herman van Hövell > > Software Engineer > > Databricks Inc. > > hvanhov...@databricks.com > > +31 6 420 590 27 > > databricks.com > > [image: http://databricks.com] <http://databricks.com/> > > > [image: Join Databricks at Spark Summit 2017 in San Francisco, the world's > largest event for the Apache Spark community.] <http://ssum.it/2mKQ3te> >