Re: [PR] [GH-2376] Preserve CRS SRID from GeoParquet metadata when reading geometries [sedona]

via GitHub Wed, 18 Feb 2026 21:58:06 -0800


Copilot commented on code in PR #2661:
URL: https://github.com/apache/sedona/pull/2661#discussion_r2825826264



##########
spark/common/src/main/scala/org/apache/spark/sql/execution/datasources/geoparquet/GeoParquetSchemaConverter.scala:
##########
@@ -78,6 +78,21 @@ class GeoParquetToSparkSchemaConverter(
       
conf.get(PortableSQLConf.PARQUET_INFER_TIMESTAMP_NTZ_ENABLED.key).toBoolean,
     parameters = parameters)
 
+  /**
+   * Returns the SRID for a geometry column based on the CRS metadata in the 
GeoParquet file.
+   * @param columnName
+   *   name of the geometry column
+   * @return
+   *   the SRID extracted from the CRS metadata. Returns 4326 if the CRS is 
omitted, 0 if the CRS
+   *   is null or unrecognized, or the EPSG code if the PROJJSON contains an 
EPSG identifier.
+   */
+  def getSrid(columnName: String): Int = {
+    geoParquetMetaData.columns.get(columnName) match {
+      case Some(fieldMetadata) => 
GeoParquetMetaData.extractSridFromCrs(fieldMetadata.crs)
+      case None => 0
+    }
+  }

Review Comment:
   getSrid() can be called frequently during row conversion, and it currently 
triggers CRS parsing (Proj/PROJJSON) on every call via extractSridFromCrs(). 
Consider caching the computed SRID per column (e.g., build a Map[String, Int] 
once from geoParquetMetaData.columns) so row decoding only does a map lookup. 
Also, the Scaladoc describes behavior for omitted/null/unrecognized CRS but 
doesn’t mention that an unknown columnName returns 0; please document that case 
(or change the default if appropriate).



##########
spark/common/src/main/scala/org/apache/spark/sql/execution/datasources/geoparquet/GeoParquetRowConverter.scala:
##########
@@ -233,6 +235,8 @@ private[geoparquet] class GeoParquetRowConverter(
                 val wkbReader = new WKBReader()
                 val byteArray = currentArray.map(_.asInstanceOf[Byte]).toArray
                 val geom = wkbReader.read(byteArray)
+                val srid = schemaConverter.getSrid(parquetType.getName)
+                geom.setSRID(srid)
                 this.updater.set(GeometryUDT.serialize(geom))

Review Comment:
   Same as the primitive geometry path: SRID is computed inside end() for every 
array value, which can repeatedly parse CRS metadata. Compute the SRID once 
when constructing this converter and reuse it to avoid per-row CRS parsing 
overhead.



##########
spark/common/src/main/scala/org/apache/spark/sql/execution/datasources/geoparquet/GeoParquetMetaData.scala:
##########
@@ -137,6 +138,57 @@ object GeoParquetMetaData {
     compactJson(serializedGeoObject)
   }
 
+  /**
+   * Default SRID for GeoParquet files where the CRS field is omitted. Per the 
GeoParquet spec,
+   * omitting the CRS implies OGC:CRS84, which is equivalent to EPSG:4326.
+   */
+  val DEFAULT_SRID: Int = 4326
+
+  /**
+   * Extract SRID from a GeoParquet CRS metadata value.
+   *
+   * Per the GeoParquet specification:
+   *   - If the CRS field is absent (None), the CRS is OGC:CRS84 (EPSG:4326).
+   *   - If the CRS field is explicitly null, the CRS is unknown (SRID 0).
+   *   - If the CRS field is a PROJJSON object with an "id" containing 
"authority" and "code", the
+   *     EPSG code is used as the SRID.
+   *
+   * @param crs
+   *   The CRS field from GeoParquet column metadata.
+   * @return
+   *   The SRID corresponding to the CRS. Returns 4326 for omitted CRS, 0 for 
null or unrecognized
+   *   CRS, and the EPSG code for PROJJSON with an EPSG identifier.
+   */
+  def extractSridFromCrs(crs: Option[JValue]): Int = {
+    crs match {
+      case None =>
+        // CRS omitted: default to OGC:CRS84 (EPSG:4326) per GeoParquet spec
+        DEFAULT_SRID
+      case Some(JNull) =>
+        // CRS explicitly null: unknown CRS
+        0
+      case Some(projjson) =>
+        // Use proj4sedona to extract authority and code from PROJJSON
+        try {
+          val jsonStr = compactJson(projjson)
+          val result = new Proj(jsonStr).toAuthority()
+          if (result != null && result.length == 2) {
+            result(0) match {
+              case "EPSG" =>
+                try { result(1).toInt }
+                catch { case _: NumberFormatException => 0 }
+              case "OGC" if result(1) == "CRS84" => 4326
+              case _ => 0
+            }
+          } else {
+            0
+          }
+        } catch {
+          case _: Exception => 0
+        }

Review Comment:
   extractSridFromCrs catches all Exception and silently returns 0. This can 
accidentally swallow InterruptedException and other non-data-related failures. 
Prefer catching NonFatal and, if an InterruptedException occurs, restore the 
thread interrupt status before returning. Also consider using DEFAULT_SRID 
instead of the hard-coded 4326 in the OGC:CRS84 branch for consistency.



##########
spark/common/src/main/scala/org/apache/spark/sql/execution/datasources/geoparquet/GeoParquetRowConverter.scala:
##########
@@ -220,6 +220,8 @@ private[geoparquet] class GeoParquetRowConverter(
             override def addBinary(value: Binary): Unit = {
               val wkbReader = new WKBReader()
               val geom = wkbReader.read(value.getBytes)
+              val srid = schemaConverter.getSrid(parquetType.getName)
+              geom.setSRID(srid)
               this.updater.set(GeometryUDT.serialize(geom))

Review Comment:
   SRID is looked up and CRS is parsed inside addBinary for every geometry 
value. Since getSrid() ultimately calls extractSridFromCrs() which instantiates 
Proj and parses PROJJSON, this adds per-row overhead. Consider computing the 
SRID once when creating the converter for this field (e.g., val srid = 
schemaConverter.getSrid(parquetType.getName) outside addBinary) and reusing it 
for all rows of that column (or caching SRIDs in the schemaConverter).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [GH-2376] Preserve CRS SRID from GeoParquet metadata when reading geometries [sedona]

Reply via email to