Hi,
I've upgraded from Sedona 1.3.1-incubating to 1.4.0 but am still seeing a
significant slowdown in spatial joins as task processing increases.
For instance, for a spatial partition count of 4102, the first 3000 proceeds
under 3 sec but then gets progressively worse for the same amount of shuffle
read records.
The final 100 tasks take over 1 hour each to complete.
I've tried to disable the global index which seemed to reduce failures but am
still seeing the performance degradation:
// sedona indexing properties
sparkSession.conf.set("sedona.join.gridtype", "kdbtree")
sparkSession.conf.set("sedona.global.index", "false")
sparkSession.conf.set("sedona.join.indexbuildside", "left")
if (appConf.getNrInputPartitions > 0) {
sparkSession.conf.set("spark.sql.shuffle.partitions",
appConf.getNrInputPartitions.toString)
sparkSession.conf.set("sedona.join.numpartition",
(appConf.getNrInputPartitions).toString)
sparkSession.conf.set("spark.default.parallelism",
appConf.getNrInputPartitions.toString)
LOG.info(s"Set spark.default.parallelism, spark.sql.shuffle.partitions to:
${appConf.getNrInputPartitions}")
}
I am using a RangeJoin with the dataframe API:
st_intersects(geom, ST_collect(ST_Point(start_lon, start_lat),
ST_Point(end_lon, end_lat)))
[cid:[email protected]]
[cid:[email protected]]
Is this a bug or are there steps or settings I could use to get to more stable
performance?
Thanks
Trang