Hi Pietro, As you see from our conversation, for the time being, you can disable Spark Adaptive Query processing by "spark.sql.adaptive.enabled=false". I believe this will fix this issue.
Adam and I will dive deep in this issue and fix this bug. Thanks, Jia On Thu, Aug 5, 2021 at 3:10 PM Adam Binford <adam...@gmail.com> wrote: > I don't think that's the issue. The join detection is the same for both > broadcast and non-broadcast, so the same match statement needs to run > either way. I created an issue for what I found from the stack trace (don't > have a copy of the stack trace to share easily): > https://issues.apache.org/jira/browse/SEDONA-56 > > Adam > > On Wed, Aug 4, 2021 at 9:02 PM Jia Yu <jiayu198...@gmail.com> wrote: > >> Hi Adam, >> >> I believe the issue is caused by this chunk of code: >> https://github.com/apache/incubator-sedona/blob/master/sql/src/main/scala/org/apache/spark/sql/sedona_sql/strategy/join/JoinQueryDetector.scala#L84-L109 >> >> If we move the broadcast join detection as the first part of the detector >> and set other join detection to the "else". Will it fix the issue? >> >> if (broadcast XXX) >> >> else { >> all other join detection. >> } >> >> Thanks, >> Jia >> >> On Tue, Aug 3, 2021 at 11:19 AM Adam Binford <adam...@gmail.com> wrote: >> >>> Okay I actually did encounter it today. It happens when you have AQE >>> enabled. Looked into it a little bit and might have to rework the >>> SpatialIndexExec node to extend BroadcastExchangeLike or maybe even >>> directly BroadcastExchangeExec, but that might only be compatible with >>> Spark 3+, so not sure what to do about that. I'm not sure if there's >>> specific AQE rules or optimizations that can be disabled to get it to >>> work, >>> but if you just disable it completely it should work for now. I'm also >>> not >>> at all familiar with the inner workings of AQE to know what the right way >>> to properly work with that is. >>> >>> Adam >>> >>> On Tue, Aug 3, 2021 at 7:46 AM Adam Binford <adam...@gmail.com> wrote: >>> >>> > I haven't encountered any issues with it but I can investigate with the >>> > full stacktrace. Also which version of Spark is this with? >>> > >>> > Adam >>> > >>> > On Tue, Aug 3, 2021 at 4:25 AM Jia Yu <ji...@apache.org> wrote: >>> > >>> >> Hi Pietro, >>> >> >>> >> Can you please share the full stacktrace of this scala.MatchError? I >>> tried >>> >> a couple test cases but wasn't able to reproduce this error on my >>> end. In >>> >> fact, another user complained about the same issue a while back. I >>> suspect >>> >> there is a bug for this part. >>> >> >>> >> I also CCed the contributor of Sedona broadcast join. @ >>> adam...@gmail.com >>> >> <adam...@gmail.com> Hi Adam, do you have any idea about this issue? >>> >> >>> >> Thanks, >>> >> Jia >>> >> >>> >> On Mon, Aug 2, 2021 at 12:43 AM pietro greselin <p.grese...@gmail.com >>> > >>> >> wrote: >>> >> >>> >> > Hello Jia, >>> >> > >>> >> > thank you so much for your support. >>> >> > >>> >> > We have been able to complete our task and to perform a few runs >>> with >>> >> > different number of partitions. >>> >> > At the moment we obtained the best performance when running on 20 >>> nodes >>> >> > and setting the number of partitions to be 2000. With this >>> >> configuration, >>> >> > it took approximately 45 minutes to write the join's output. >>> >> > >>> >> > Then we tried to perform the same join through broadcast as you >>> >> suggested >>> >> > to see whether we could achieve better results but actually we >>> obtained >>> >> the >>> >> > following error when calling an action like broadcast_join.show() >>> on the >>> >> > output >>> >> > >>> >> > Py4JJavaError: An error occurred while calling o699.showString. >>> >> > : scala.MatchError: SpatialIndex polygonshape#422: geometry, >>> QUADTREE, >>> >> [id=#3312] >>> >> > >>> >> > We would be grateful if you can support us on this. >>> >> > >>> >> > >>> >> > The broadcast join was performed as follows: broadcast_join = >>> >> >>> points_df.alias('points').join(f.broadcast(polygons_df).alias('polygons'), >>> >> f.expr('ST_Contains(polygons.polygonshape, points.pointshape)')) >>> >> > >>> >> > Attached you can find the pseudo code we used to test broadcast >>> join. >>> >> > >>> >> > >>> >> > Looking forward to hearing from you. >>> >> > >>> >> > >>> >> > Kind regards, >>> >> > >>> >> > Pietro Greselin >>> >> > >>> >> > >>> >> > On Wed, 28 Jul 2021 at 00:46, Jia Yu <ji...@apache.org> wrote: >>> >> > >>> >> >> Hi Pietro, >>> >> >> >>> >> >> A few tips to optimize your join: >>> >> >> >>> >> >> 1. Mix DF and RDD together and use RDD API for the join part. See >>> the >>> >> >> example here: >>> >> >> >>> >> >>> https://github.com/apache/incubator-sedona/blob/master/binder/ApacheSedonaSQL_SpatialJoin_AirportsPerCountry.ipynb >>> >> >> >>> >> >> 2. When use point_rdd.spatialPartitioning(GridType.KDBTREE, 4), >>> try to >>> >> >> use a large number of partitions (say 1000 or more) >>> >> >> >>> >> >> If this approach doesn't work, consider broadcast join if needed. >>> >> >> Broadcast the polygon side: >>> >> >> https://sedona.apache.org/api/sql/Optimizer/#broadcast-join >>> >> >> >>> >> >> Thanks, >>> >> >> Jia >>> >> >> >>> >> >> >>> >> >> On Tue, Jul 27, 2021 at 2:21 AM pietro greselin < >>> p.grese...@gmail.com> >>> >> >> wrote: >>> >> >> >>> >> >>> To whom it may concern, >>> >> >>> >>> >> >>> we reported the following Sedona behaviour and would like to ask >>> your >>> >> >>> opinion on how we can otpimize it. >>> >> >>> >>> >> >>> Our aim is to perform a inner spatial join between a points_df >>> and a >>> >> >>> polygon_df when a point in points_df is contained in a polygon >>> from >>> >> >>> polygons_df. >>> >> >>> Below you can find more details about the 2 dataframes we are >>> >> >>> considering: >>> >> >>> - points_df: it contains 50mln events with latitude and longitude >>> >> >>> approximated to the third decimal digit; >>> >> >>> - polygon_df: it contains 10k multi-polygons having 600 vertexes >>> on >>> >> >>> average. >>> >> >>> >>> >> >>> The issue we are reporting is a very long computing time and the >>> >> spatial >>> >> >>> join query never completing even when running on cluster with 40 >>> >> workers >>> >> >>> with 4 cores each. >>> >> >>> No error is being print by driver but we are receiving the >>> following >>> >> >>> warning: >>> >> >>> WARN org.apache.sedona.core.spatialOperator.JoinQuery: UseIndex is >>> >> true, >>> >> >>> but no index exists. Will build index on the fly. >>> >> >>> >>> >> >>> Actually we were able to run successfully the same spatial join >>> when >>> >> >>> only considering a very small sample of events. >>> >> >>> Do you have any suggestion on how we can archive the same result >>> on >>> >> >>> higher volumes of data or if there is a way we can optimize the >>> join? >>> >> >>> >>> >> >>> Attached you can find the pseudo-code we are running. >>> >> >>> >>> >> >>> Looking forward to hearing from you. >>> >> >>> >>> >> >>> Kind regards, >>> >> >>> Pietro Greselin >>> >> >>> >>> >> >> >>> >> >>> > >>> > >>> > -- >>> > Adam Binford >>> > >>> >>> >>> -- >>> Adam Binford >>> >> > > -- > Adam Binford >