Hi Pietro,

A few tips to optimize your join:

1. Mix DF and RDD together and use RDD API for the join part. See the
example here:
https://github.com/apache/incubator-sedona/blob/master/binder/ApacheSedonaSQL_SpatialJoin_AirportsPerCountry.ipynb

2. When use point_rdd.spatialPartitioning(GridType.KDBTREE, 4), try to use
a large number of partitions (say 1000 or more)

If this approach doesn't work, consider broadcast join if needed. Broadcast
the polygon side:
https://sedona.apache.org/api/sql/Optimizer/#broadcast-join

Thanks,
Jia


On Tue, Jul 27, 2021 at 2:21 AM pietro greselin <p.grese...@gmail.com>
wrote:

> To whom it may concern,
>
> we reported the following Sedona behaviour and would like to ask your
> opinion on how we can otpimize it.
>
> Our aim is to perform a inner spatial join between a points_df and a
> polygon_df when a point in points_df is contained in a polygon from
> polygons_df.
> Below you can find more details about the 2 dataframes we are considering:
> - points_df: it contains 50mln events with latitude and longitude
> approximated to the third decimal digit;
> - polygon_df: it contains 10k multi-polygons having 600 vertexes on
> average.
>
> The issue we are reporting is a very long computing time and the spatial
> join query never completing even when running on cluster with 40 workers
> with 4 cores each.
> No error is being print by driver but we are receiving the following
> warning:
> WARN org.apache.sedona.core.spatialOperator.JoinQuery: UseIndex is true,
> but no index exists. Will build index on the fly.
>
> Actually we were able to run successfully the same spatial join when only
> considering a very small sample of events.
> Do you have any suggestion on how we can archive the same result on higher
> volumes of data or if there is a way we can optimize the join?
>
> Attached you can find the pseudo-code we are running.
>
> Looking forward to hearing from you.
>
> Kind regards,
> Pietro Greselin
>

Reply via email to