[
https://issues.apache.org/jira/browse/SEDONA-733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17947977#comment-17947977
]
Shaunak GuhaThakurata commented on SEDONA-733:
----------------------------------------------
Thanks [~kontinuation] for your prompt reply. We will look into the
partitioning and associated spark configuration parameters.
The data is public - USGS 10M DEM and Overture buildings.
> Raster functions are extremely slow on Google DataProc Spark
> ------------------------------------------------------------
>
> Key: SEDONA-733
> URL: https://issues.apache.org/jira/browse/SEDONA-733
> Project: Apache Sedona
> Issue Type: Bug
> Reporter: Shaunak GuhaThakurata
> Priority: Major
>
> Workloads involving raster data, especially RS_Clip, RS_ZontalStats functions
> are extremely slow. My workload is relatively simple. I am calculating mean
> elevation of open source structure footprints using the 10m DEM rasters.
> * My rasters (10m DEM) are divided into ~ 912 COG geotiff files. The COGs
> are 16x16 tiled
> * The vector layer of structure footprints is in geoparquet format,
> partitioned by county FIPS code, which makes the footprints in each partition
> co-located spatially
> * I am testing with one county's data : ~ 134,000 structure footprints
> * Rasters are kept in a GCS bucket
> * Output is being written to GCS bucket
> * I am running on Google DataProc v 2.2
> * Spark version 3.5. Sedona 1.7.1
> * Cluster config: workers nodes: 3. Machine type: n1-highmem-64
> The CPU utilization of the worker nodes are always below 10%. There is some
> initial network traffic but ultimately the traffic as well as disk I/O in the
> cluster reduces to nearly 0.
> Is this an expected behavior? Are there any workarounds to improve
> performance?
> Any advice is greatly appreciated.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)