Martin Andersson created SEDONA-641:
---------------------------------------
Summary: Add optimizer for non broadcast outer joins
Key: SEDONA-641
URL: https://issues.apache.org/jira/browse/SEDONA-641
Project: Apache Sedona
Issue Type: Improvement
Reporter: Martin Andersson
Sedona doesn't optimize non broadcast outer joins. That has some surprising
effects for inexperienced Sedona users.
* Changing a join from inner to outer makes Spark fall back to a nested loop -
grinding to a halt.
* If an automatically broadcasted outer join grows above the broadcast
threashold the plan changes from a Sedona optimized broadcast join to a Spark
nested loop.
To work around the lack of outer joins they have to be implemented by the user.
Usually by adding a uuid to the datasets, join geospatially with an inner join
and then join with the datasets again using an outer join with the uuids. The
user implemented outer join obfuscates the business logic.
Implementing outer join as efficiently as inner join is hard. It might be worth
adding an optimizer for outer joins even if it's not ideal. It could be
implemented with uuids, an inner geospatial join and an outer join using the
uuids. That would be significantly slower than the inner join optimizer but
would be way better than Sparks nested loop. If I remember correctly this is
how an early version of geospark did deduplication before the current
deduplication logic was implemented.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)