[GitHub] [sedona] jornfranke opened a new issue, #926: Getting values for points in a given raster takes very long time

via GitHub Thu, 27 Jul 2023 09:13:48 -0700


jornfranke opened a new issue, #926:
URL: https://github.com/apache/sedona/issues/926


   ## Expected behavior
   
   I have a spatial dataset with points which I load from a parquet file. 
Essentially it has an id, longitude and latitude. It does not matter really, 
even a small one with a few points (e.g. 5)
   
   I join the spatial dataset with a Raster file Geotiff 
(https://cidportal.jrc.ec.europa.eu/ftp/jrc-opendata/FLOODS/EuropeanMaps/floodMap_RP100.zip
 , overview page: 
https://data.jrc.ec.europa.eu/dataset/1d128b6c-a4ee-4858-9e34-6210707f3c81) .
   
   Then, I need to get the value of the raster for each point. 
   
   The code is the following
   
   ```
   df = spark.read.parquet("path/to/dataset/with/longitude_latitude")
   df.createOrReplaceTempView("pointDF")
   df = spark.sql('SELECT id, ST_Point(CAST(longitude AS Decimal(24,20)), 
CAST(latitude AS Decimal(24,20))) as geometry FROM 
pointDF').withColumnRenamed("geometry","geometry_points")
   pointDf = df.repartition("id")
   rasterDf = 
spark.read.format("binaryFile").load("path/to/raster/floodmap_EFAS_RP100_C.tif")\
       .withColumn(f"raster", expr(f"RS_FromGeoTiff(content})"))
   pointDf=pointDf.join(rasterDf)\
         .withColumn(f"raster_value",expr(f"RS_Value(raster,geometry_points)"))\
         .drop(f"raster",f"content")
   pointDf.show(2)
   ```
   
   This should work in reasonable time.
   
   ## Actual behavior
   
   Even for very few points it take ages to get the value (> 10 min) on a very 
powerful cluster (although it is not even remotely consumed).
   
   For other rasters (much smaller, < 2MB) this works perfectly reasonable fast 
- even for million of points.
   
   ## Steps to reproduce the problem
   
   See description above
   
   ## Settings
   
   Sedona version = 1.4.1
   
   Apache Spark version = 3.2.0
   
   Apache Flink version = not used
   
   API type = Scala, Java, Python?
   
   Scala version = 2.12
   
   JRE version = 1.8
   
   Python version = 3.9
   
   Environment = Cloudera CDP


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@sedona.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [sedona] jornfranke opened a new issue, #926: Getting values for points in a given raster takes very long time

Reply via email to