Kontinuation opened a new pull request, #1281: URL: https://github.com/apache/sedona/pull/1281
## Did you read the Contributor Guide? - Yes, I have read [Contributor Rules](https://sedona.apache.org/latest-snapshot/community/rule/) and [Contributor Development Guide](https://sedona.apache.org/latest-snapshot/community/develop/) ## Is this PR related to a JIRA ticket? - Yes, the URL of the associated JIRA ticket is https://issues.apache.org/jira/browse/SEDONA-406. The PR name follows the format `[SEDONA-XXX] my subject`. ## What changes were proposed in this PR? ### API changes This PR adds a new class `SedonaRaster` to sedona python package. Raster objects in sedona will be converted to `SedonaRaster` objects in python when collecting raster objects in PySpark: ```python rows = df_rast.collect() rast = rows[0]['rast'] # rast is a SedonaRaster object # You can get the metadata of raster by accessing the properties of SedonaRaster objects print(rast.width, rast.height) print(rast.affine_trans) print(rast.crs_wkt) # You can get the band data as numpy array arr = rast.as_numpy() # You can also get a rasterio DatasetReader object ds = rast.as_rasterio() # Please close the SedonaRaster after using it to free up resources allocated for the rasterio DatasetReader object rast.close() ``` Users can define PandasUDFs taking raster object as parameter. Please use the `deserialize` function in `sedona.raster.raster_serde` module to deserialize the bytes to `SedonaRaster` object before processing it. Please note that this only works with Spark >= 3.4.0. ```python # A Python Pandas UDF that takes a geometry as input @pandas_udf(IntegerType()) def pandas_udf_raster_as_param(s: pd.Series) -> pd.Series: from sedona.raster import raster_serde def func(x): with raster_serde.deserialize(x) as raster: arr = raster.as_numpy() return int(np.sum(arr)) return s.apply(func) spark.udf.register("pandas_udf_raster_as_param", pandas_udf_raster_as_param) ``` ### Internal changes * Changed the serialization format of RasterUDT to a language-neutral format * Notably, CRS is now serialized to WKT instead of using the Java serializer. * This also significantly improved the performance of raster serialization/deserialization, since the new Kryo serailizer is way faster than Java serializer we used before. * Added a raster deserializer for PySpark ## How was this patch tested? Added new tests ## Did this PR include necessary documentation updates? - **TODO**: We need to document the usage of `SedonaRaster` and `raster_serde`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@sedona.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org