Re: DataFrame joins much slower than SpatialRDD joins

Jia Yu Wed, 14 Apr 2021 16:10:29 -0700

Hi folks,

Some thoughts regarding this.

1. DataFrame Geom serializer was changed to WKB serializer from Sedona
original SHAPE serializer, in 1.0.0 [1]. Based on our test, the old
serializer was several times faster than the WKB serializer [2]. We are
working on a PR to support both SHAPE serializer and WKB serializer [3].
But Sedona RDD API still uses the SHAPE serializer. Serializer is used
intensively in join.
2. DataFrame Scala / Java API automatically does spatial partitioning and
builds an index on one side of the join (Python API 1.0.0 has a bug that
does not use index) . RDD API requires the user to do so. But in the code
from Andrew, he only does spatial partitioning on RDD join without
indexing. On a small dataset used in such case, building index may take
more time than the version without index. Maybe Andrew could try to build
index in RDD join as well. I believe the time difference will be decreased.

Thanks,
Jia

[1]
https://github.com/apache/incubator-sedona/blob/master/sql/src/main/scala/org/apache/sedona/sql/utils/GeometrySerializer.scala#L37
[2] https://gist.github.com/netanel246/f85777761ebfc0a5ddef54170ea62f11
[3] https://github.com/apache/incubator-sedona/pull/516

On Wed, Apr 14, 2021 at 3:49 AM Adam Binford <[email protected]> wrote:

> Are you using the 1.0.0 release? If so, there's a bug that prevented
> spatial indexing from being used in SQL join queries, which hopefully
> explains the difference. Also, there will be broadcast join support too
> which could make the SQL join even faster than RDD join for large-small
> joins.
>
> Adam
>
> On Tue, Apr 13, 2021 at 7:20 PM Andrew Brooks <
> [email protected]>
> wrote:
>
> > I've noticed that performing joins with the DataFrame API tends to be
> > significantly slower than using the SpatialRDD API directly. To
> illustrate,
> > I've put together a simple benchmark, which generates 10k points and 10k
> > envelopes at random, then counts the number of envelope/point pairs such
> > that the point is contained in the envelope:
> > https://gist.github.com/agbrooks/3f82bc7894e931e93a3d8de0a16cfba0
> >
> >
> >
> > On my laptop, the DataFrame-based implementation in this benchmark takes
> > nearly 10 times as long to execute as the SpatialRDD-based implementation
> > (536 vs. 53 seconds).
> >
> >
> >
> > Is the performance discrepancy caused by misuse of the API, some inherent
> > limitation of the DataFrame-based API, a Sedona bug, or something else
> > entirely?
> >
> >
> >
> > If it's relevant, I'm running with Scala 2.12.13 / Spark 3.0.2 and using
> > the latest commit on the Sedona master branch.
> >
> >
> >
> > Best regards,
> >
> > Andrew Brooks
> >
>
>
> --
> Adam Binford
>

Re: DataFrame joins much slower than SpatialRDD joins

Reply via email to