Hi folks, Some thoughts regarding this.
1. DataFrame Geom serializer was changed to WKB serializer from Sedona original SHAPE serializer, in 1.0.0 [1]. Based on our test, the old serializer was several times faster than the WKB serializer [2]. We are working on a PR to support both SHAPE serializer and WKB serializer [3]. But Sedona RDD API still uses the SHAPE serializer. Serializer is used intensively in join. 2. DataFrame Scala / Java API automatically does spatial partitioning and builds an index on one side of the join (Python API 1.0.0 has a bug that does not use index) . RDD API requires the user to do so. But in the code from Andrew, he only does spatial partitioning on RDD join without indexing. On a small dataset used in such case, building index may take more time than the version without index. Maybe Andrew could try to build index in RDD join as well. I believe the time difference will be decreased. Thanks, Jia [1] https://github.com/apache/incubator-sedona/blob/master/sql/src/main/scala/org/apache/sedona/sql/utils/GeometrySerializer.scala#L37 [2] https://gist.github.com/netanel246/f85777761ebfc0a5ddef54170ea62f11 [3] https://github.com/apache/incubator-sedona/pull/516 On Wed, Apr 14, 2021 at 3:49 AM Adam Binford <[email protected]> wrote: > Are you using the 1.0.0 release? If so, there's a bug that prevented > spatial indexing from being used in SQL join queries, which hopefully > explains the difference. Also, there will be broadcast join support too > which could make the SQL join even faster than RDD join for large-small > joins. > > Adam > > On Tue, Apr 13, 2021 at 7:20 PM Andrew Brooks < > [email protected]> > wrote: > > > I've noticed that performing joins with the DataFrame API tends to be > > significantly slower than using the SpatialRDD API directly. To > illustrate, > > I've put together a simple benchmark, which generates 10k points and 10k > > envelopes at random, then counts the number of envelope/point pairs such > > that the point is contained in the envelope: > > https://gist.github.com/agbrooks/3f82bc7894e931e93a3d8de0a16cfba0 > > > > > > > > On my laptop, the DataFrame-based implementation in this benchmark takes > > nearly 10 times as long to execute as the SpatialRDD-based implementation > > (536 vs. 53 seconds). > > > > > > > > Is the performance discrepancy caused by misuse of the API, some inherent > > limitation of the DataFrame-based API, a Sedona bug, or something else > > entirely? > > > > > > > > If it's relevant, I'm running with Scala 2.12.13 / Spark 3.0.2 and using > > the latest commit on the Sedona master branch. > > > > > > > > Best regards, > > > > Andrew Brooks > > > > > -- > Adam Binford >
