Re: Spatial join performances

Jia Yu Wed, 04 Aug 2021 18:02:15 -0700

Hi Adam,

I believe the issue is caused by this chunk of code:
https://github.com/apache/incubator-sedona/blob/master/sql/src/main/scala/org/apache/spark/sql/sedona_sql/strategy/join/JoinQueryDetector.scala#L84-L109


If we move the broadcast join detection as the first part of the detector
and set other join detection to the "else". Will it fix the issue?

if (broadcast XXX)

else {
all other join detection.
}

Thanks,
Jia

On Tue, Aug 3, 2021 at 11:19 AM Adam Binford <[email protected]> wrote:

> Okay I actually did encounter it today. It happens when you have AQE
> enabled. Looked into it a little bit and might have to rework the
> SpatialIndexExec node to extend BroadcastExchangeLike or maybe even
> directly BroadcastExchangeExec, but that might only be compatible with
> Spark 3+, so not sure what to do about that. I'm not sure if there's
> specific AQE rules or optimizations that can be disabled to get it to work,
> but if you just disable it completely it should work for now. I'm also not
> at all familiar with the inner workings of AQE to know what the right way
> to properly work with that is.
>
> Adam
>
> On Tue, Aug 3, 2021 at 7:46 AM Adam Binford <[email protected]> wrote:
>
> > I haven't encountered any issues with it but I can investigate with the
> > full stacktrace. Also which version of Spark is this with?
> >
> > Adam
> >
> > On Tue, Aug 3, 2021 at 4:25 AM Jia Yu <[email protected]> wrote:
> >
> >> Hi Pietro,
> >>
> >> Can you please share the full stacktrace of this scala.MatchError? I
> tried
> >> a couple test cases but wasn't able to reproduce this error on my end.
> In
> >> fact, another user complained about the same issue a while back. I
> suspect
> >> there is a bug for this part.
> >>
> >> I also CCed the contributor of Sedona broadcast join. @
> [email protected]
> >> <[email protected]> Hi Adam, do you have any idea about this issue?
> >>
> >> Thanks,
> >> Jia
> >>
> >> On Mon, Aug 2, 2021 at 12:43 AM pietro greselin <[email protected]>
> >> wrote:
> >>
> >> > Hello Jia,
> >> >
> >> > thank you so much for your support.
> >> >
> >> > We have been able to complete our task and to perform a few runs with
> >> > different number of partitions.
> >> > At the moment we obtained the best performance when running on 20
> nodes
> >> > and setting the number of partitions to be 2000. With this
> >> configuration,
> >> > it took approximately 45 minutes to write the join's output.
> >> >
> >> > Then we tried to perform the same join through broadcast as you
> >> suggested
> >> > to see whether we could achieve better results but actually we
> obtained
> >> the
> >> > following error when calling an action like broadcast_join.show() on
> the
> >> > output
> >> >
> >> > Py4JJavaError: An error occurred while calling o699.showString.
> >> > : scala.MatchError: SpatialIndex polygonshape#422: geometry, QUADTREE,
> >> [id=#3312]
> >> >
> >> > We would be grateful if you can support us on this.
> >> >
> >> >
> >> > The broadcast join was performed as follows: broadcast_join =
> >>
> points_df.alias('points').join(f.broadcast(polygons_df).alias('polygons'),
> >> f.expr('ST_Contains(polygons.polygonshape, points.pointshape)'))
> >> >
> >> > Attached you can find the pseudo code we used to test broadcast join.
> >> >
> >> >
> >> > Looking forward to hearing from you.
> >> >
> >> >
> >> > Kind regards,
> >> >
> >> > Pietro Greselin
> >> >
> >> >
> >> > On Wed, 28 Jul 2021 at 00:46, Jia Yu <[email protected]> wrote:
> >> >
> >> >> Hi Pietro,
> >> >>
> >> >> A few tips to optimize your join:
> >> >>
> >> >> 1. Mix DF and RDD together and use RDD API for the join part. See the
> >> >> example here:
> >> >>
> >>
> https://github.com/apache/incubator-sedona/blob/master/binder/ApacheSedonaSQL_SpatialJoin_AirportsPerCountry.ipynb
> >> >>
> >> >> 2. When use point_rdd.spatialPartitioning(GridType.KDBTREE, 4), try
> to
> >> >> use a large number of partitions (say 1000 or more)
> >> >>
> >> >> If this approach doesn't work, consider broadcast join if needed.
> >> >> Broadcast the polygon side:
> >> >> https://sedona.apache.org/api/sql/Optimizer/#broadcast-join
> >> >>
> >> >> Thanks,
> >> >> Jia
> >> >>
> >> >>
> >> >> On Tue, Jul 27, 2021 at 2:21 AM pietro greselin <
> [email protected]>
> >> >> wrote:
> >> >>
> >> >>> To whom it may concern,
> >> >>>
> >> >>> we reported the following Sedona behaviour and would like to ask
> your
> >> >>> opinion on how we can otpimize it.
> >> >>>
> >> >>> Our aim is to perform a inner spatial join between a points_df and a
> >> >>> polygon_df when a point in points_df is contained in a polygon from
> >> >>> polygons_df.
> >> >>> Below you can find more details about the 2 dataframes we are
> >> >>> considering:
> >> >>> - points_df: it contains 50mln events with latitude and longitude
> >> >>> approximated to the third decimal digit;
> >> >>> - polygon_df: it contains 10k multi-polygons having 600 vertexes on
> >> >>> average.
> >> >>>
> >> >>> The issue we are reporting is a very long computing time and the
> >> spatial
> >> >>> join query never completing even when running on cluster with 40
> >> workers
> >> >>> with 4 cores each.
> >> >>> No error is being print by driver but we are receiving the following
> >> >>> warning:
> >> >>> WARN org.apache.sedona.core.spatialOperator.JoinQuery: UseIndex is
> >> true,
> >> >>> but no index exists. Will build index on the fly.
> >> >>>
> >> >>> Actually we were able to run successfully the same spatial join when
> >> >>> only considering a very small sample of events.
> >> >>> Do you have any suggestion on how we can archive the same result on
> >> >>> higher volumes of data or if there is a way we can optimize the
> join?
> >> >>>
> >> >>> Attached you can find the pseudo-code we are running.
> >> >>>
> >> >>> Looking forward to hearing from you.
> >> >>>
> >> >>> Kind regards,
> >> >>> Pietro Greselin
> >> >>>
> >> >>
> >>
> >
> >
> > --
> > Adam Binford
> >
>
>
> --
> Adam Binford
>

Re: Spatial join performances

Reply via email to