Kristin Cowalcijk created SEDONA-532:
----------------------------------------
Summary: Sedona Spark SQL optimizer cannot optimize joins with
complex conditions
Key: SEDONA-532
URL: https://issues.apache.org/jira/browse/SEDONA-532
Project: Apache Sedona
Issue Type: Bug
Affects Versions: 1.5.1
Reporter: Kristin Cowalcijk
Fix For: 1.6.0
Sedona fails to optimize spatial joins with complex join conditions. Here is an
example:
{code:python}
df1 = spark.range(1, 1000).withColumn("geom", expr("ST_Point(id, id)"))
df2 = spark.range(1, 1000).withColumn("geom", expr("ST_Buffer(ST_Point(id, id),
2)"))
df1.createOrReplaceTempView("df1")
df2.createOrReplaceTempView("df2")
spark.sql("SELECT * FROM df1 JOIN df2 ON ST_Intersects(df1.geom, df2.geom) AND
df1.id > df2.id AND df1.id < df2.id + 10").explain()
{code}
This join has a spatial condition {{ST_Intersects(df1.geom, df2.goem)}}, but it
is planned as a broadcast nested loop join:
{code}
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- BroadcastNestedLoopJoin BuildRight, Inner, ((
**org.apache.spark.sql.sedona_sql.expressions.ST_Intersects** AND (id#434L >
id#439L)) AND (id#434L < (id#439L + 10)))
:- Project [id#434L,
**org.apache.spark.sql.sedona_sql.expressions.ST_Point** AS geom#436]
: +- Filter isnotnull(
**org.apache.spark.sql.sedona_sql.expressions.ST_Point** )
: +- Range (1, 1000, step=1, splits=10)
+- BroadcastExchange IdentityBroadcastMode, [plan_id=1157]
+- Project [id#439L,
**org.apache.spark.sql.sedona_sql.expressions.ST_Buffer** AS geom#441]
+- Filter isnotnull(
**org.apache.spark.sql.sedona_sql.expressions.ST_Buffer** )
+- Range (1, 1000, step=1, splits=10)
{code}
If we specify a simpler join condition, Sedona will optimize the join correctly:
{code:python}
spark.sql("SELECT * FROM df1 JOIN df2 ON ST_Intersects(df1.geom, df2.geom) AND
df1.id > df2.id").explain()
{code}
Output:
{code}
== Physical Plan ==
BroadcastIndexJoin geom#500: geometry, LeftSide, LeftSide, Inner, INTERSECTS,
(id#493L > id#498L) ST_INTERSECTS(geom#495, geom#500)
:- SpatialIndex geom#495: geometry, RTREE, false, false
: +- Project [id#493L,
**org.apache.spark.sql.sedona_sql.expressions.ST_Point** AS geom#495]
: +- Filter isnotnull(
**org.apache.spark.sql.sedona_sql.expressions.ST_Point** )
: +- *(1) Range (1, 1000, step=1, splits=10)
+- Project [id#498L, **org.apache.spark.sql.sedona_sql.expressions.ST_Buffer**
AS geom#500]
+- Filter isnotnull(
**org.apache.spark.sql.sedona_sql.expressions.ST_Buffer** )
+- *(2) Range (1, 1000, step=1, splits=10)
{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)