Re: Spatial join performances

Jia Yu Thu, 05 Aug 2021 22:29:11 -0700

Hi Pietro,

As you see from our conversation, for the time being, you can disable Spark
Adaptive Query processing by "spark.sql.adaptive.enabled=false". I believe
this will fix this issue.


Adam and I will dive deep in this issue and fix this bug.

Thanks,
Jia


On Thu, Aug 5, 2021 at 3:10 PM Adam Binford <adam...@gmail.com> wrote:

> I don't think that's the issue. The join detection is the same for both
> broadcast and non-broadcast, so the same match statement needs to run
> either way. I created an issue for what I found from the stack trace (don't
> have a copy of the stack trace to share easily):
> https://issues.apache.org/jira/browse/SEDONA-56
>
> Adam
>
> On Wed, Aug 4, 2021 at 9:02 PM Jia Yu <jiayu198...@gmail.com> wrote:
>
>> Hi Adam,
>>
>> I believe the issue is caused by this chunk of code:
>> https://github.com/apache/incubator-sedona/blob/master/sql/src/main/scala/org/apache/spark/sql/sedona_sql/strategy/join/JoinQueryDetector.scala#L84-L109
>>
>> If we move the broadcast join detection as the first part of the detector
>> and set other join detection to the "else". Will it fix the issue?
>>
>> if (broadcast XXX)
>>
>> else {
>> all other join detection.
>> }
>>
>> Thanks,
>> Jia
>>
>> On Tue, Aug 3, 2021 at 11:19 AM Adam Binford <adam...@gmail.com> wrote:
>>
>>> Okay I actually did encounter it today. It happens when you have AQE
>>> enabled. Looked into it a little bit and might have to rework the
>>> SpatialIndexExec node to extend BroadcastExchangeLike or maybe even
>>> directly BroadcastExchangeExec, but that might only be compatible with
>>> Spark 3+, so not sure what to do about that. I'm not sure if there's
>>> specific AQE rules or optimizations that can be disabled to get it to
>>> work,
>>> but if you just disable it completely it should work for now. I'm also
>>> not
>>> at all familiar with the inner workings of AQE to know what the right way
>>> to properly work with that is.
>>>
>>> Adam
>>>
>>> On Tue, Aug 3, 2021 at 7:46 AM Adam Binford <adam...@gmail.com> wrote:
>>>
>>> > I haven't encountered any issues with it but I can investigate with the
>>> > full stacktrace. Also which version of Spark is this with?
>>> >
>>> > Adam
>>> >
>>> > On Tue, Aug 3, 2021 at 4:25 AM Jia Yu <ji...@apache.org> wrote:
>>> >
>>> >> Hi Pietro,
>>> >>
>>> >> Can you please share the full stacktrace of this scala.MatchError? I
>>> tried
>>> >> a couple test cases but wasn't able to reproduce this error on my
>>> end. In
>>> >> fact, another user complained about the same issue a while back. I
>>> suspect
>>> >> there is a bug for this part.
>>> >>
>>> >> I also CCed the contributor of Sedona broadcast join. @
>>> adam...@gmail.com
>>> >> <adam...@gmail.com> Hi Adam, do you have any idea about this issue?
>>> >>
>>> >> Thanks,
>>> >> Jia
>>> >>
>>> >> On Mon, Aug 2, 2021 at 12:43 AM pietro greselin <p.grese...@gmail.com
>>> >
>>> >> wrote:
>>> >>
>>> >> > Hello Jia,
>>> >> >
>>> >> > thank you so much for your support.
>>> >> >
>>> >> > We have been able to complete our task and to perform a few runs
>>> with
>>> >> > different number of partitions.
>>> >> > At the moment we obtained the best performance when running on 20
>>> nodes
>>> >> > and setting the number of partitions to be 2000. With this
>>> >> configuration,
>>> >> > it took approximately 45 minutes to write the join's output.
>>> >> >
>>> >> > Then we tried to perform the same join through broadcast as you
>>> >> suggested
>>> >> > to see whether we could achieve better results but actually we
>>> obtained
>>> >> the
>>> >> > following error when calling an action like broadcast_join.show()
>>> on the
>>> >> > output
>>> >> >
>>> >> > Py4JJavaError: An error occurred while calling o699.showString.
>>> >> > : scala.MatchError: SpatialIndex polygonshape#422: geometry,
>>> QUADTREE,
>>> >> [id=#3312]
>>> >> >
>>> >> > We would be grateful if you can support us on this.
>>> >> >
>>> >> >
>>> >> > The broadcast join was performed as follows: broadcast_join =
>>> >>
>>> points_df.alias('points').join(f.broadcast(polygons_df).alias('polygons'),
>>> >> f.expr('ST_Contains(polygons.polygonshape, points.pointshape)'))
>>> >> >
>>> >> > Attached you can find the pseudo code we used to test broadcast
>>> join.
>>> >> >
>>> >> >
>>> >> > Looking forward to hearing from you.
>>> >> >
>>> >> >
>>> >> > Kind regards,
>>> >> >
>>> >> > Pietro Greselin
>>> >> >
>>> >> >
>>> >> > On Wed, 28 Jul 2021 at 00:46, Jia Yu <ji...@apache.org> wrote:
>>> >> >
>>> >> >> Hi Pietro,
>>> >> >>
>>> >> >> A few tips to optimize your join:
>>> >> >>
>>> >> >> 1. Mix DF and RDD together and use RDD API for the join part. See
>>> the
>>> >> >> example here:
>>> >> >>
>>> >>
>>> https://github.com/apache/incubator-sedona/blob/master/binder/ApacheSedonaSQL_SpatialJoin_AirportsPerCountry.ipynb
>>> >> >>
>>> >> >> 2. When use point_rdd.spatialPartitioning(GridType.KDBTREE, 4),
>>> try to
>>> >> >> use a large number of partitions (say 1000 or more)
>>> >> >>
>>> >> >> If this approach doesn't work, consider broadcast join if needed.
>>> >> >> Broadcast the polygon side:
>>> >> >> https://sedona.apache.org/api/sql/Optimizer/#broadcast-join
>>> >> >>
>>> >> >> Thanks,
>>> >> >> Jia
>>> >> >>
>>> >> >>
>>> >> >> On Tue, Jul 27, 2021 at 2:21 AM pietro greselin <
>>> p.grese...@gmail.com>
>>> >> >> wrote:
>>> >> >>
>>> >> >>> To whom it may concern,
>>> >> >>>
>>> >> >>> we reported the following Sedona behaviour and would like to ask
>>> your
>>> >> >>> opinion on how we can otpimize it.
>>> >> >>>
>>> >> >>> Our aim is to perform a inner spatial join between a points_df
>>> and a
>>> >> >>> polygon_df when a point in points_df is contained in a polygon
>>> from
>>> >> >>> polygons_df.
>>> >> >>> Below you can find more details about the 2 dataframes we are
>>> >> >>> considering:
>>> >> >>> - points_df: it contains 50mln events with latitude and longitude
>>> >> >>> approximated to the third decimal digit;
>>> >> >>> - polygon_df: it contains 10k multi-polygons having 600 vertexes
>>> on
>>> >> >>> average.
>>> >> >>>
>>> >> >>> The issue we are reporting is a very long computing time and the
>>> >> spatial
>>> >> >>> join query never completing even when running on cluster with 40
>>> >> workers
>>> >> >>> with 4 cores each.
>>> >> >>> No error is being print by driver but we are receiving the
>>> following
>>> >> >>> warning:
>>> >> >>> WARN org.apache.sedona.core.spatialOperator.JoinQuery: UseIndex is
>>> >> true,
>>> >> >>> but no index exists. Will build index on the fly.
>>> >> >>>
>>> >> >>> Actually we were able to run successfully the same spatial join
>>> when
>>> >> >>> only considering a very small sample of events.
>>> >> >>> Do you have any suggestion on how we can archive the same result
>>> on
>>> >> >>> higher volumes of data or if there is a way we can optimize the
>>> join?
>>> >> >>>
>>> >> >>> Attached you can find the pseudo-code we are running.
>>> >> >>>
>>> >> >>> Looking forward to hearing from you.
>>> >> >>>
>>> >> >>> Kind regards,
>>> >> >>> Pietro Greselin
>>> >> >>>
>>> >> >>
>>> >>
>>> >
>>> >
>>> > --
>>> > Adam Binford
>>> >
>>>
>>>
>>> --
>>> Adam Binford
>>>
>>
>
> --
> Adam Binford
>

Re: Spatial join performances

Reply via email to