jiayuasu commented on issue #854: URL: https://github.com/apache/sedona/issues/854#issuecomment-1583448888
There are 3 important params involved in a spatial join: 1. Spatial partitioning dominant side: `sedona.join.spatitionside`. Default: LEFT 2. Spatial index build side: `sedona.join.indexbuildside`. Default: LEFT. See https://github.com/apache/sedona/blob/master/core/src/main/java/org/apache/sedona/core/joinJudgement/DynamicIndexLookupJudgement.java#L91 3. Num of partitions of both RDDs used in a join. Default to use the num partition of `sedona.join.spatitionside` but will be optimized to reasonable partitions if the data is way less than num of partitions. The best practice is the Spatial partitioning dominant side and Spatial index build side should always be the large dataset (not the smaller dataset). To find out which one is larger, you can use the count of both RDDs. Note that: SpatialRDD.analyze() function already computes the count. You can leverage that to automatically determine the dominant side: https://github.com/apache/sedona/blob/master/sql/common/src/main/scala/org/apache/spark/sql/sedona_sql/strategy/join/TraitJoinQueryExec.scala#L59 You can add the automation and leave `sedona.join.spatitionside` and `sedona.join.indexbuildside` as optional. In other words, our optimizer will automatically determine the two sides unless the user explicitly specifies the parameters. I will leave the implementation to you. But if you feel this is too hard, please let me know. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
