Kristin Cowalcijk created SEDONA-532:
----------------------------------------

             Summary: Sedona Spark SQL optimizer cannot optimize joins with 
complex conditions
                 Key: SEDONA-532
                 URL: https://issues.apache.org/jira/browse/SEDONA-532
             Project: Apache Sedona
          Issue Type: Bug
    Affects Versions: 1.5.1
            Reporter: Kristin Cowalcijk
             Fix For: 1.6.0


Sedona fails to optimize spatial joins with complex join conditions. Here is an 
example:

{code:python}
df1 = spark.range(1, 1000).withColumn("geom", expr("ST_Point(id, id)"))
df2 = spark.range(1, 1000).withColumn("geom", expr("ST_Buffer(ST_Point(id, id), 
2)"))
df1.createOrReplaceTempView("df1")
df2.createOrReplaceTempView("df2")
spark.sql("SELECT * FROM df1 JOIN df2 ON ST_Intersects(df1.geom, df2.geom) AND 
df1.id > df2.id AND df1.id < df2.id + 10").explain()
{code}

This join has a spatial condition {{ST_Intersects(df1.geom, df2.goem)}}, but it 
is planned as a broadcast nested loop join:

{code}
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- BroadcastNestedLoopJoin BuildRight, Inner, (( 
**org.apache.spark.sql.sedona_sql.expressions.ST_Intersects**   AND (id#434L > 
id#439L)) AND (id#434L < (id#439L + 10)))
   :- Project [id#434L,  
**org.apache.spark.sql.sedona_sql.expressions.ST_Point**   AS geom#436]
   :  +- Filter isnotnull( 
**org.apache.spark.sql.sedona_sql.expressions.ST_Point**  )
   :     +- Range (1, 1000, step=1, splits=10)
   +- BroadcastExchange IdentityBroadcastMode, [plan_id=1157]
      +- Project [id#439L,  
**org.apache.spark.sql.sedona_sql.expressions.ST_Buffer**   AS geom#441]
         +- Filter isnotnull( 
**org.apache.spark.sql.sedona_sql.expressions.ST_Buffer**  )
            +- Range (1, 1000, step=1, splits=10)
{code}

If we specify a simpler join condition, Sedona will optimize the join correctly:

{code:python}
spark.sql("SELECT * FROM df1 JOIN df2 ON ST_Intersects(df1.geom, df2.geom) AND 
df1.id > df2.id").explain()
{code}

Output:

{code}
== Physical Plan ==
BroadcastIndexJoin geom#500: geometry, LeftSide, LeftSide, Inner, INTERSECTS, 
(id#493L > id#498L) ST_INTERSECTS(geom#495, geom#500)
:- SpatialIndex geom#495: geometry, RTREE, false, false
:  +- Project [id#493L,  
**org.apache.spark.sql.sedona_sql.expressions.ST_Point**   AS geom#495]
:     +- Filter isnotnull( 
**org.apache.spark.sql.sedona_sql.expressions.ST_Point**  )
:        +- *(1) Range (1, 1000, step=1, splits=10)
+- Project [id#498L,  **org.apache.spark.sql.sedona_sql.expressions.ST_Buffer** 
  AS geom#500]
   +- Filter isnotnull( 
**org.apache.spark.sql.sedona_sql.expressions.ST_Buffer**  )
      +- *(2) Range (1, 1000, step=1, splits=10)
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to