[
https://issues.apache.org/jira/browse/SEDONA-532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17835369#comment-17835369
]
Kristin Cowalcijk commented on SEDONA-532:
------------------------------------------
There is also a (probably) related GitHub issue:
[https://github.com/apache/sedona/issues/855]
I cannot reproduce this issue using Sedona 1.5.1, but I suspect that it is a
similar problem.
> Sedona Spark SQL optimizer cannot optimize joins with complex conditions
> ------------------------------------------------------------------------
>
> Key: SEDONA-532
> URL: https://issues.apache.org/jira/browse/SEDONA-532
> Project: Apache Sedona
> Issue Type: Bug
> Affects Versions: 1.5.1
> Reporter: Kristin Cowalcijk
> Priority: Major
> Fix For: 1.6.0
>
>
> Sedona fails to optimize spatial joins with complex join conditions. Here is
> an example:
> {code:python}
> df1 = spark.range(1, 1000).withColumn("geom", expr("ST_Point(id, id)"))
> df2 = spark.range(1, 1000).withColumn("geom", expr("ST_Buffer(ST_Point(id,
> id), 2)"))
> df1.createOrReplaceTempView("df1")
> df2.createOrReplaceTempView("df2")
> spark.sql("SELECT * FROM df1 JOIN df2 ON ST_Intersects(df1.geom, df2.geom)
> AND df1.id > df2.id AND df1.id < df2.id + 10").explain()
> {code}
> This join has a spatial condition {{ST_Intersects(df1.geom, df2.goem)}}, but
> it is planned as a broadcast nested loop join:
> {code}
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- BroadcastNestedLoopJoin BuildRight, Inner, ((
> **org.apache.spark.sql.sedona_sql.expressions.ST_Intersects** AND (id#434L
> > id#439L)) AND (id#434L < (id#439L + 10)))
> :- Project [id#434L,
> **org.apache.spark.sql.sedona_sql.expressions.ST_Point** AS geom#436]
> : +- Filter isnotnull(
> **org.apache.spark.sql.sedona_sql.expressions.ST_Point** )
> : +- Range (1, 1000, step=1, splits=10)
> +- BroadcastExchange IdentityBroadcastMode, [plan_id=1157]
> +- Project [id#439L,
> **org.apache.spark.sql.sedona_sql.expressions.ST_Buffer** AS geom#441]
> +- Filter isnotnull(
> **org.apache.spark.sql.sedona_sql.expressions.ST_Buffer** )
> +- Range (1, 1000, step=1, splits=10)
> {code}
> If we specify a simpler join condition, Sedona will optimize the join
> correctly:
> {code:python}
> spark.sql("SELECT * FROM df1 JOIN df2 ON ST_Intersects(df1.geom, df2.geom)
> AND df1.id > df2.id").explain()
> {code}
> Output:
> {code}
> == Physical Plan ==
> BroadcastIndexJoin geom#500: geometry, LeftSide, LeftSide, Inner, INTERSECTS,
> (id#493L > id#498L) ST_INTERSECTS(geom#495, geom#500)
> :- SpatialIndex geom#495: geometry, RTREE, false, false
> : +- Project [id#493L,
> **org.apache.spark.sql.sedona_sql.expressions.ST_Point** AS geom#495]
> : +- Filter isnotnull(
> **org.apache.spark.sql.sedona_sql.expressions.ST_Point** )
> : +- *(1) Range (1, 1000, step=1, splits=10)
> +- Project [id#498L,
> **org.apache.spark.sql.sedona_sql.expressions.ST_Buffer** AS geom#500]
> +- Filter isnotnull(
> **org.apache.spark.sql.sedona_sql.expressions.ST_Buffer** )
> +- *(2) Range (1, 1000, step=1, splits=10)
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)