Hi Pietro, A few tips to optimize your join:
1. Mix DF and RDD together and use RDD API for the join part. See the example here: https://github.com/apache/incubator-sedona/blob/master/binder/ApacheSedonaSQL_SpatialJoin_AirportsPerCountry.ipynb 2. When use point_rdd.spatialPartitioning(GridType.KDBTREE, 4), try to use a large number of partitions (say 1000 or more) If this approach doesn't work, consider broadcast join if needed. Broadcast the polygon side: https://sedona.apache.org/api/sql/Optimizer/#broadcast-join Thanks, Jia On Tue, Jul 27, 2021 at 2:21 AM pietro greselin <p.grese...@gmail.com> wrote: > To whom it may concern, > > we reported the following Sedona behaviour and would like to ask your > opinion on how we can otpimize it. > > Our aim is to perform a inner spatial join between a points_df and a > polygon_df when a point in points_df is contained in a polygon from > polygons_df. > Below you can find more details about the 2 dataframes we are considering: > - points_df: it contains 50mln events with latitude and longitude > approximated to the third decimal digit; > - polygon_df: it contains 10k multi-polygons having 600 vertexes on > average. > > The issue we are reporting is a very long computing time and the spatial > join query never completing even when running on cluster with 40 workers > with 4 cores each. > No error is being print by driver but we are receiving the following > warning: > WARN org.apache.sedona.core.spatialOperator.JoinQuery: UseIndex is true, > but no index exists. Will build index on the fly. > > Actually we were able to run successfully the same spatial join when only > considering a very small sample of events. > Do you have any suggestion on how we can archive the same result on higher > volumes of data or if there is a way we can optimize the join? > > Attached you can find the pseudo-code we are running. > > Looking forward to hearing from you. > > Kind regards, > Pietro Greselin >