[jira] [Commented] (SEDONA-532) Sedona Spark SQL optimizer cannot optimize joins with complex conditions

Kristin Cowalcijk (Jira) Tue, 09 Apr 2024 05:14:33 -0700


    [ 
https://issues.apache.org/jira/browse/SEDONA-532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17835369#comment-17835369
 ]


Kristin Cowalcijk commented on SEDONA-532:
------------------------------------------

There is also a (probably) related GitHub issue: 
[https://github.com/apache/sedona/issues/855]

I cannot reproduce this issue using Sedona 1.5.1, but I suspect that it is a 
similar problem.

> Sedona Spark SQL optimizer cannot optimize joins with complex conditions
> ------------------------------------------------------------------------
>
>                 Key: SEDONA-532
>                 URL: https://issues.apache.org/jira/browse/SEDONA-532
>             Project: Apache Sedona
>          Issue Type: Bug
>    Affects Versions: 1.5.1
>            Reporter: Kristin Cowalcijk
>            Priority: Major
>             Fix For: 1.6.0
>
>
> Sedona fails to optimize spatial joins with complex join conditions. Here is 
> an example:
> {code:python}
> df1 = spark.range(1, 1000).withColumn("geom", expr("ST_Point(id, id)"))
> df2 = spark.range(1, 1000).withColumn("geom", expr("ST_Buffer(ST_Point(id, 
> id), 2)"))
> df1.createOrReplaceTempView("df1")
> df2.createOrReplaceTempView("df2")
> spark.sql("SELECT * FROM df1 JOIN df2 ON ST_Intersects(df1.geom, df2.geom) 
> AND df1.id > df2.id AND df1.id < df2.id + 10").explain()
> {code}
> This join has a spatial condition {{ST_Intersects(df1.geom, df2.goem)}}, but 
> it is planned as a broadcast nested loop join:
> {code}
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- BroadcastNestedLoopJoin BuildRight, Inner, (( 
> **org.apache.spark.sql.sedona_sql.expressions.ST_Intersects**   AND (id#434L 
> > id#439L)) AND (id#434L < (id#439L + 10)))
>    :- Project [id#434L,  
> **org.apache.spark.sql.sedona_sql.expressions.ST_Point**   AS geom#436]
>    :  +- Filter isnotnull( 
> **org.apache.spark.sql.sedona_sql.expressions.ST_Point**  )
>    :     +- Range (1, 1000, step=1, splits=10)
>    +- BroadcastExchange IdentityBroadcastMode, [plan_id=1157]
>       +- Project [id#439L,  
> **org.apache.spark.sql.sedona_sql.expressions.ST_Buffer**   AS geom#441]
>          +- Filter isnotnull( 
> **org.apache.spark.sql.sedona_sql.expressions.ST_Buffer**  )
>             +- Range (1, 1000, step=1, splits=10)
> {code}
> If we specify a simpler join condition, Sedona will optimize the join 
> correctly:
> {code:python}
> spark.sql("SELECT * FROM df1 JOIN df2 ON ST_Intersects(df1.geom, df2.geom) 
> AND df1.id > df2.id").explain()
> {code}
> Output:
> {code}
> == Physical Plan ==
> BroadcastIndexJoin geom#500: geometry, LeftSide, LeftSide, Inner, INTERSECTS, 
> (id#493L > id#498L) ST_INTERSECTS(geom#495, geom#500)
> :- SpatialIndex geom#495: geometry, RTREE, false, false
> :  +- Project [id#493L,  
> **org.apache.spark.sql.sedona_sql.expressions.ST_Point**   AS geom#495]
> :     +- Filter isnotnull( 
> **org.apache.spark.sql.sedona_sql.expressions.ST_Point**  )
> :        +- *(1) Range (1, 1000, step=1, splits=10)
> +- Project [id#498L,  
> **org.apache.spark.sql.sedona_sql.expressions.ST_Buffer**   AS geom#500]
>    +- Filter isnotnull( 
> **org.apache.spark.sql.sedona_sql.expressions.ST_Buffer**  )
>       +- *(2) Range (1, 1000, step=1, splits=10)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (SEDONA-532) Sedona Spark SQL optimizer cannot optimize joins with complex conditions

Reply via email to