[jira] [Assigned] (SPARK-25963) Optimize generate followed by window

Apache Spark (JIRA) Wed, 07 Nov 2018 04:35:20 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-25963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Apache Spark reassigned SPARK-25963:
------------------------------------

    Assignee:     (was: Apache Spark)

> Optimize generate followed by window
> ------------------------------------
>
>                 Key: SPARK-25963
>                 URL: https://issues.apache.org/jira/browse/SPARK-25963
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.3.0, 2.4.0
>            Reporter: Ohad Raviv
>            Priority: Minor
>
> We've noticed that for our use-cases when we have explode followed by a 
> window function we can almost always optimize it by adding repartition by the 
> windows' partition before the explode.
> for example:
> {code:java}
> import org.apache.spark.sql.functions._
> val N = 1 << 12
> spark.sql("set spark.sql.autoBroadcastJoinThreshold=0")
> val tokens = spark.range(N).selectExpr(
> "floor(id/4) as key", "'asd/asd/asd/asd/asd/asd' as url")
> // .repartition("cust_id")
> .selectExpr("*", "explode(split(url, '/')) as token")
> import org.apache.spark.sql.expressions._
> val w = Window.partitionBy("key", "token")
> val res = tokens.withColumn("cnt", count("token").over(w))
> res.explain(true)
> {code}
> {noformat}
> == Optimized Logical Plan ==
> Window [count(token#11) windowspecdefinition(key#6L, token#11, 
> specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) 
> AS cnt#17L], [key#6L, token#11]
> +- Generate explode([asd,asd,asd,asd,asd,asd]), false, [token#11]
>    +- Project [FLOOR((cast(id#4L as double) / 4.0)) AS key#6L, 
> asd/asd/asd/asd/asd/asd AS url#7]
>       +- Range (0, 4096, step=1, splits=Some(1))
> {noformat}
> currently all the data will be exploded in the first stage, then shuffled and 
> then aggregated.
> we can achieve exactly the same computation if we first shuffle the data and 
> in the second stage explode and aggregate.
> I have a PR that tries to resolve this. I'm just not sure I thought about all 
> the cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25963) Optimize generate followed by window

Reply via email to