[PR] [SEDONA-406] Raster deserializer for PySpark (#116) [sedona]

via GitHub Wed, 20 Mar 2024 21:20:26 -0700


Kontinuation opened a new pull request, #1281:
URL: https://github.com/apache/sedona/pull/1281


   ## Did you read the Contributor Guide?
   
   - Yes, I have read [Contributor 
Rules](https://sedona.apache.org/latest-snapshot/community/rule/) and 
[Contributor Development 
Guide](https://sedona.apache.org/latest-snapshot/community/develop/)
   
   ## Is this PR related to a JIRA ticket?
   
   - Yes, the URL of the associated JIRA ticket is 
https://issues.apache.org/jira/browse/SEDONA-406. The PR name follows the 
format `[SEDONA-XXX] my subject`.
   
   ## What changes were proposed in this PR?
   
   ### API changes
   
   This PR adds a new class `SedonaRaster` to sedona python package. Raster 
objects in sedona will be converted to `SedonaRaster` objects in python when 
collecting raster objects in PySpark:
   
   ```python
   rows = df_rast.collect()
   rast = rows[0]['rast']  # rast is a SedonaRaster object
   
   # You can get the metadata of raster by accessing the properties of 
SedonaRaster objects
   print(rast.width, rast.height)
   print(rast.affine_trans)
   print(rast.crs_wkt)
   
   # You can get the band data as numpy array
   arr = rast.as_numpy()
   
   # You can also get a rasterio DatasetReader object
   ds = rast.as_rasterio()
   
   # Please close the SedonaRaster after using it to free up resources 
allocated for the rasterio DatasetReader object
   rast.close()
   ```
   
   Users can define PandasUDFs taking raster object as parameter. Please use 
the `deserialize` function in `sedona.raster.raster_serde` module to 
deserialize the bytes to `SedonaRaster` object before processing it. Please 
note that this only works with Spark >= 3.4.0.
   
   ```python
   # A Python Pandas UDF that takes a geometry as input
   @pandas_udf(IntegerType())
   def pandas_udf_raster_as_param(s: pd.Series) -> pd.Series:
       from sedona.raster import raster_serde
   
       def func(x):
           with raster_serde.deserialize(x) as raster:
               arr = raster.as_numpy()
               return int(np.sum(arr))
   
       return s.apply(func)
   
   spark.udf.register("pandas_udf_raster_as_param", pandas_udf_raster_as_param)
   ```
   
   ### Internal changes
   
   * Changed the serialization format of RasterUDT to a language-neutral format
     * Notably, CRS is now serialized to WKT instead of using the Java 
serializer.
     * This also significantly improved the performance of raster 
serialization/deserialization, since the new Kryo serailizer is way faster than 
Java serializer we used before.
   * Added a raster deserializer for PySpark
   
   ## How was this patch tested?
   
   Added new tests
   
   ## Did this PR include necessary documentation updates?
   
   - **TODO**: We need to document the usage of `SedonaRaster` and 
`raster_serde`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@sedona.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[PR] [SEDONA-406] Raster deserializer for PySpark (#116) [sedona]

Reply via email to