I've noticed that performing joins with the DataFrame API tends to be significantly slower than using the SpatialRDD API directly. To illustrate, I've put together a simple benchmark, which generates 10k points and 10k envelopes at random, then counts the number of envelope/point pairs such that the point is contained in the envelope: https://gist.github.com/agbrooks/3f82bc7894e931e93a3d8de0a16cfba0
On my laptop, the DataFrame-based implementation in this benchmark takes nearly 10 times as long to execute as the SpatialRDD-based implementation (536 vs. 53 seconds). Is the performance discrepancy caused by misuse of the API, some inherent limitation of the DataFrame-based API, a Sedona bug, or something else entirely? If it's relevant, I'm running with Scala 2.12.13 / Spark 3.0.2 and using the latest commit on the Sedona master branch. Best regards, Andrew Brooks
