Kristin Cowalcijk created SEDONA-532: ----------------------------------------
Summary: Sedona Spark SQL optimizer cannot optimize joins with complex conditions Key: SEDONA-532 URL: https://issues.apache.org/jira/browse/SEDONA-532 Project: Apache Sedona Issue Type: Bug Affects Versions: 1.5.1 Reporter: Kristin Cowalcijk Fix For: 1.6.0 Sedona fails to optimize spatial joins with complex join conditions. Here is an example: {code:python} df1 = spark.range(1, 1000).withColumn("geom", expr("ST_Point(id, id)")) df2 = spark.range(1, 1000).withColumn("geom", expr("ST_Buffer(ST_Point(id, id), 2)")) df1.createOrReplaceTempView("df1") df2.createOrReplaceTempView("df2") spark.sql("SELECT * FROM df1 JOIN df2 ON ST_Intersects(df1.geom, df2.geom) AND df1.id > df2.id AND df1.id < df2.id + 10").explain() {code} This join has a spatial condition {{ST_Intersects(df1.geom, df2.goem)}}, but it is planned as a broadcast nested loop join: {code} == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- BroadcastNestedLoopJoin BuildRight, Inner, (( **org.apache.spark.sql.sedona_sql.expressions.ST_Intersects** AND (id#434L > id#439L)) AND (id#434L < (id#439L + 10))) :- Project [id#434L, **org.apache.spark.sql.sedona_sql.expressions.ST_Point** AS geom#436] : +- Filter isnotnull( **org.apache.spark.sql.sedona_sql.expressions.ST_Point** ) : +- Range (1, 1000, step=1, splits=10) +- BroadcastExchange IdentityBroadcastMode, [plan_id=1157] +- Project [id#439L, **org.apache.spark.sql.sedona_sql.expressions.ST_Buffer** AS geom#441] +- Filter isnotnull( **org.apache.spark.sql.sedona_sql.expressions.ST_Buffer** ) +- Range (1, 1000, step=1, splits=10) {code} If we specify a simpler join condition, Sedona will optimize the join correctly: {code:python} spark.sql("SELECT * FROM df1 JOIN df2 ON ST_Intersects(df1.geom, df2.geom) AND df1.id > df2.id").explain() {code} Output: {code} == Physical Plan == BroadcastIndexJoin geom#500: geometry, LeftSide, LeftSide, Inner, INTERSECTS, (id#493L > id#498L) ST_INTERSECTS(geom#495, geom#500) :- SpatialIndex geom#495: geometry, RTREE, false, false : +- Project [id#493L, **org.apache.spark.sql.sedona_sql.expressions.ST_Point** AS geom#495] : +- Filter isnotnull( **org.apache.spark.sql.sedona_sql.expressions.ST_Point** ) : +- *(1) Range (1, 1000, step=1, splits=10) +- Project [id#498L, **org.apache.spark.sql.sedona_sql.expressions.ST_Buffer** AS geom#500] +- Filter isnotnull( **org.apache.spark.sql.sedona_sql.expressions.ST_Buffer** ) +- *(2) Range (1, 1000, step=1, splits=10) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)