DataFrame joins much slower than SpatialRDD joins

Andrew Brooks Tue, 13 Apr 2021 16:20:35 -0700

I've noticed that performing joins with the DataFrame API tends to be 
significantly slower than using the SpatialRDD API directly. To illustrate, 
I've put together a simple benchmark, which generates 10k points and 10k 
envelopes at random, then counts the number of envelope/point pairs such that 
the point is contained in the envelope: 
https://gist.github.com/agbrooks/3f82bc7894e931e93a3d8de0a16cfba0




On my laptop, the DataFrame-based implementation in this benchmark takes nearly 
10 times as long to execute as the SpatialRDD-based implementation (536 vs. 53 
seconds).



Is the performance discrepancy caused by misuse of the API, some inherent 
limitation of the DataFrame-based API, a Sedona bug, or something else entirely?



If it's relevant, I'm running with Scala 2.12.13 / Spark 3.0.2 and using the 
latest commit on the Sedona master branch.



Best regards,

Andrew Brooks

DataFrame joins much slower than SpatialRDD joins

Reply via email to