Re: Spatial join performances

2021-08-05 Thread Jia Yu
Hi Pietro, As you see from our conversation, for the time being, you can disable Spark Adaptive Query processing by "spark.sql.adaptive.enabled=false". I believe this will fix this issue. Adam and I will dive deep in this issue and fix this bug. Thanks, Jia On Thu, Aug 5, 2021 at 3:10 PM Adam

Re: Spatial join performances

2021-08-05 Thread Adam Binford
I don't think that's the issue. The join detection is the same for both broadcast and non-broadcast, so the same match statement needs to run either way. I created an issue for what I found from the stack trace (don't have a copy of the stack trace to share easily):

Re: Spatial join performances

2021-08-04 Thread Jia Yu
Hi Adam, I believe the issue is caused by this chunk of code: https://github.com/apache/incubator-sedona/blob/master/sql/src/main/scala/org/apache/spark/sql/sedona_sql/strategy/join/JoinQueryDetector.scala#L84-L109 If we move the broadcast join detection as the first part of the detector and set

Re: Spatial join performances

2021-08-03 Thread Adam Binford
Okay I actually did encounter it today. It happens when you have AQE enabled. Looked into it a little bit and might have to rework the SpatialIndexExec node to extend BroadcastExchangeLike or maybe even directly BroadcastExchangeExec, but that might only be compatible with Spark 3+, so not sure

Re: Spatial join performances

2021-08-03 Thread Adam Binford
I haven't encountered any issues with it but I can investigate with the full stacktrace. Also which version of Spark is this with? Adam On Tue, Aug 3, 2021 at 4:25 AM Jia Yu wrote: > Hi Pietro, > > Can you please share the full stacktrace of this scala.MatchError? I tried > a couple test cases

Re: Spatial join performances

2021-08-03 Thread Jia Yu
Hi Pietro, Can you please share the full stacktrace of this scala.MatchError? I tried a couple test cases but wasn't able to reproduce this error on my end. In fact, another user complained about the same issue a while back. I suspect there is a bug for this part. I also CCed the contributor of

Re: Spatial join performances

2021-08-02 Thread pietro greselin
Hello Jia, thank you so much for your support. We have been able to complete our task and to perform a few runs with different number of partitions. At the moment we obtained the best performance when running on 20 nodes and setting the number of partitions to be 2000. With this configuration,

Re: Spatial join performances

2021-07-27 Thread Jia Yu
Hi Pietro, A few tips to optimize your join: 1. Mix DF and RDD together and use RDD API for the join part. See the example here: https://github.com/apache/incubator-sedona/blob/master/binder/ApacheSedonaSQL_SpatialJoin_AirportsPerCountry.ipynb 2. When use

Spatial join performances

2021-07-27 Thread pietro greselin
To whom it may concern, we reported the following Sedona behaviour and would like to ask your opinion on how we can otpimize it. Our aim is to perform a inner spatial join between a points_df and a polygon_df when a point in points_df is contained in a polygon from polygons_df. Below you can