Performance degradation for spatial join in Spark during task processing

Trang Nguyen Wed, 22 Mar 2023 17:22:52 -0700

Hi,

I've upgraded from Sedona 1.3.1-incubating to 1.4.0 but am still seeing a 
significant slowdown in spatial joins as task processing increases.


For instance, for a spatial partition count of 4102, the first 3000 proceeds 
under 3 sec but then gets progressively worse for the same amount of shuffle 
read records.
The final 100 tasks take over 1 hour each to complete.

I've tried to disable the global index which seemed to reduce failures but am 
still seeing the performance degradation:

// sedona indexing properties
sparkSession.conf.set("sedona.join.gridtype", "kdbtree")
sparkSession.conf.set("sedona.global.index", "false")
sparkSession.conf.set("sedona.join.indexbuildside", "left")

if (appConf.getNrInputPartitions > 0) {
  sparkSession.conf.set("spark.sql.shuffle.partitions", 
appConf.getNrInputPartitions.toString)
  sparkSession.conf.set("sedona.join.numpartition", 
(appConf.getNrInputPartitions).toString)

  sparkSession.conf.set("spark.default.parallelism", 
appConf.getNrInputPartitions.toString)
  LOG.info(s"Set spark.default.parallelism, spark.sql.shuffle.partitions to: 
${appConf.getNrInputPartitions}")
}



I am using a RangeJoin with the dataframe API:
st_intersects(geom, ST_collect(ST_Point(start_lon, start_lat), 
ST_Point(end_lon, end_lat)))

[cid:image002.png@01D95CE1.C0A07B00]


[cid:image001.png@01D95CE1.84AFEFE0]

Is this a bug or are there steps or settings I could use to get to more stable 
performance?

Thanks
Trang

Performance degradation for spatial join in Spark during task processing

Reply via email to