Shaunak GuhaThakurata created SEDONA-733:
--------------------------------------------
Summary: Raster functions are extremely slow on Google DataProc
Spark
Key: SEDONA-733
URL: https://issues.apache.org/jira/browse/SEDONA-733
Project: Apache Sedona
Issue Type: Bug
Reporter: Shaunak GuhaThakurata
Workloads involving raster data, especially RS_Clip, RS_ZontalStats functions
are extremely slow. My workload is relatively simple. I am calculating mean
elevation of open source structure footprints using the 10m DEM rasters.
* My rasters are divided into ~ 912 COG geotiff files. The COGs are 16x16 tiled
* The vector layer of structure footprints is in geoparquet format,
partitioned by county FIPS code, which makes the footprints in each partition
co-located spatially
* Rasters are kept in a GCS bucket
* Output is being written to GCS bucket
* I am running on Google DataProc v 2.2
* Spark version 3.5. Sedona 1.7.1
* Cluster config: workers nodes: 3. Machine type: n1-highmem-64
The CPU utilization of the worker nodes are always below 10%. There is some
initial network traffic but ultimately the traffic as well as disk I/O in the
cluster reduces to nearly 0.
Is this an expected behavior? Are there any workarounds to improve performance?
Any advice is greatly appreciated.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)