Re: Spatial join performances

Jia Yu Tue, 03 Aug 2021 01:25:21 -0700

Hi Pietro,

Can you please share the full stacktrace of this scala.MatchError? I tried
a couple test cases but wasn't able to reproduce this error on my end. In
fact, another user complained about the same issue a while back. I suspect
there is a bug for this part.


I also CCed the contributor of Sedona broadcast join. @[email protected]
<[email protected]> Hi Adam, do you have any idea about this issue?

Thanks,
Jia

On Mon, Aug 2, 2021 at 12:43 AM pietro greselin <[email protected]>
wrote:

> Hello Jia,
>
> thank you so much for your support.
>
> We have been able to complete our task and to perform a few runs with
> different number of partitions.
> At the moment we obtained the best performance when running on 20 nodes
> and setting the number of partitions to be 2000. With this configuration,
> it took approximately 45 minutes to write the join's output.
>
> Then we tried to perform the same join through broadcast as you suggested
> to see whether we could achieve better results but actually we obtained the
> following error when calling an action like broadcast_join.show() on the
> output
>
> Py4JJavaError: An error occurred while calling o699.showString.
> : scala.MatchError: SpatialIndex polygonshape#422: geometry, QUADTREE, 
> [id=#3312]
>
> We would be grateful if you can support us on this.
>
>
> The broadcast join was performed as follows: broadcast_join = 
> points_df.alias('points').join(f.broadcast(polygons_df).alias('polygons'), 
> f.expr('ST_Contains(polygons.polygonshape, points.pointshape)'))
>
> Attached you can find the pseudo code we used to test broadcast join.
>
>
> Looking forward to hearing from you.
>
>
> Kind regards,
>
> Pietro Greselin
>
>
> On Wed, 28 Jul 2021 at 00:46, Jia Yu <[email protected]> wrote:
>
>> Hi Pietro,
>>
>> A few tips to optimize your join:
>>
>> 1. Mix DF and RDD together and use RDD API for the join part. See the
>> example here:
>> https://github.com/apache/incubator-sedona/blob/master/binder/ApacheSedonaSQL_SpatialJoin_AirportsPerCountry.ipynb
>>
>> 2. When use point_rdd.spatialPartitioning(GridType.KDBTREE, 4), try to
>> use a large number of partitions (say 1000 or more)
>>
>> If this approach doesn't work, consider broadcast join if needed.
>> Broadcast the polygon side:
>> https://sedona.apache.org/api/sql/Optimizer/#broadcast-join
>>
>> Thanks,
>> Jia
>>
>>
>> On Tue, Jul 27, 2021 at 2:21 AM pietro greselin <[email protected]>
>> wrote:
>>
>>> To whom it may concern,
>>>
>>> we reported the following Sedona behaviour and would like to ask your
>>> opinion on how we can otpimize it.
>>>
>>> Our aim is to perform a inner spatial join between a points_df and a
>>> polygon_df when a point in points_df is contained in a polygon from
>>> polygons_df.
>>> Below you can find more details about the 2 dataframes we are
>>> considering:
>>> - points_df: it contains 50mln events with latitude and longitude
>>> approximated to the third decimal digit;
>>> - polygon_df: it contains 10k multi-polygons having 600 vertexes on
>>> average.
>>>
>>> The issue we are reporting is a very long computing time and the spatial
>>> join query never completing even when running on cluster with 40 workers
>>> with 4 cores each.
>>> No error is being print by driver but we are receiving the following
>>> warning:
>>> WARN org.apache.sedona.core.spatialOperator.JoinQuery: UseIndex is true,
>>> but no index exists. Will build index on the fly.
>>>
>>> Actually we were able to run successfully the same spatial join when
>>> only considering a very small sample of events.
>>> Do you have any suggestion on how we can archive the same result on
>>> higher volumes of data or if there is a way we can optimize the join?
>>>
>>> Attached you can find the pseudo-code we are running.
>>>
>>> Looking forward to hearing from you.
>>>
>>> Kind regards,
>>> Pietro Greselin
>>>
>>

Re: Spatial join performances

Reply via email to