Re: DataFrame joins much slower than SpatialRDD joins

Adam Binford Wed, 14 Apr 2021 03:49:43 -0700

Are you using the 1.0.0 release? If so, there's a bug that prevented
spatial indexing from being used in SQL join queries, which hopefully
explains the difference. Also, there will be broadcast join support too
which could make the SQL join even faster than RDD join for large-small
joins.


Adam

On Tue, Apr 13, 2021 at 7:20 PM Andrew Brooks <[email protected]>
wrote:

> I've noticed that performing joins with the DataFrame API tends to be
> significantly slower than using the SpatialRDD API directly. To illustrate,
> I've put together a simple benchmark, which generates 10k points and 10k
> envelopes at random, then counts the number of envelope/point pairs such
> that the point is contained in the envelope:
> https://gist.github.com/agbrooks/3f82bc7894e931e93a3d8de0a16cfba0
>
>
>
> On my laptop, the DataFrame-based implementation in this benchmark takes
> nearly 10 times as long to execute as the SpatialRDD-based implementation
> (536 vs. 53 seconds).
>
>
>
> Is the performance discrepancy caused by misuse of the API, some inherent
> limitation of the DataFrame-based API, a Sedona bug, or something else
> entirely?
>
>
>
> If it's relevant, I'm running with Scala 2.12.13 / Spark 3.0.2 and using
> the latest commit on the Sedona master branch.
>
>
>
> Best regards,
>
> Andrew Brooks
>


-- 
Adam Binford

Re: DataFrame joins much slower than SpatialRDD joins

Reply via email to