[ https://issues.apache.org/jira/browse/SEDONA-532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17835369#comment-17835369 ]
Kristin Cowalcijk commented on SEDONA-532: ------------------------------------------ There is also a (probably) related GitHub issue: [https://github.com/apache/sedona/issues/855] I cannot reproduce this issue using Sedona 1.5.1, but I suspect that it is a similar problem. > Sedona Spark SQL optimizer cannot optimize joins with complex conditions > ------------------------------------------------------------------------ > > Key: SEDONA-532 > URL: https://issues.apache.org/jira/browse/SEDONA-532 > Project: Apache Sedona > Issue Type: Bug > Affects Versions: 1.5.1 > Reporter: Kristin Cowalcijk > Priority: Major > Fix For: 1.6.0 > > > Sedona fails to optimize spatial joins with complex join conditions. Here is > an example: > {code:python} > df1 = spark.range(1, 1000).withColumn("geom", expr("ST_Point(id, id)")) > df2 = spark.range(1, 1000).withColumn("geom", expr("ST_Buffer(ST_Point(id, > id), 2)")) > df1.createOrReplaceTempView("df1") > df2.createOrReplaceTempView("df2") > spark.sql("SELECT * FROM df1 JOIN df2 ON ST_Intersects(df1.geom, df2.geom) > AND df1.id > df2.id AND df1.id < df2.id + 10").explain() > {code} > This join has a spatial condition {{ST_Intersects(df1.geom, df2.goem)}}, but > it is planned as a broadcast nested loop join: > {code} > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=false > +- BroadcastNestedLoopJoin BuildRight, Inner, (( > **org.apache.spark.sql.sedona_sql.expressions.ST_Intersects** AND (id#434L > > id#439L)) AND (id#434L < (id#439L + 10))) > :- Project [id#434L, > **org.apache.spark.sql.sedona_sql.expressions.ST_Point** AS geom#436] > : +- Filter isnotnull( > **org.apache.spark.sql.sedona_sql.expressions.ST_Point** ) > : +- Range (1, 1000, step=1, splits=10) > +- BroadcastExchange IdentityBroadcastMode, [plan_id=1157] > +- Project [id#439L, > **org.apache.spark.sql.sedona_sql.expressions.ST_Buffer** AS geom#441] > +- Filter isnotnull( > **org.apache.spark.sql.sedona_sql.expressions.ST_Buffer** ) > +- Range (1, 1000, step=1, splits=10) > {code} > If we specify a simpler join condition, Sedona will optimize the join > correctly: > {code:python} > spark.sql("SELECT * FROM df1 JOIN df2 ON ST_Intersects(df1.geom, df2.geom) > AND df1.id > df2.id").explain() > {code} > Output: > {code} > == Physical Plan == > BroadcastIndexJoin geom#500: geometry, LeftSide, LeftSide, Inner, INTERSECTS, > (id#493L > id#498L) ST_INTERSECTS(geom#495, geom#500) > :- SpatialIndex geom#495: geometry, RTREE, false, false > : +- Project [id#493L, > **org.apache.spark.sql.sedona_sql.expressions.ST_Point** AS geom#495] > : +- Filter isnotnull( > **org.apache.spark.sql.sedona_sql.expressions.ST_Point** ) > : +- *(1) Range (1, 1000, step=1, splits=10) > +- Project [id#498L, > **org.apache.spark.sql.sedona_sql.expressions.ST_Buffer** AS geom#500] > +- Filter isnotnull( > **org.apache.spark.sql.sedona_sql.expressions.ST_Buffer** ) > +- *(2) Range (1, 1000, step=1, splits=10) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)