szehon-ho commented on code in PR #55207: URL: https://github.com/apache/spark/pull/55207#discussion_r3047796739
########## docs/sql-ref-geospatial-types.md: ########## @@ -142,6 +142,92 @@ SELECT ST_Srid(ST_SetSrid(ST_GeomFromWKB(X'0101000000000000000000F03F00000000000 * **Fixed-SRID columns**: Every value in the column must have the same SRID as the column type. Inserting a value with a different SRID can raise an error (or you can use `ST_SetSrid` to set the value’s SRID to match the column). * **Mixed-SRID columns** (`GEOMETRY(ANY)` or `GEOGRAPHY(ANY)`): Values can have different SRIDs. Only valid SRIDs are allowed. * **Storage**: Parquet, Delta, and Iceberg store geometry/geography with a fixed SRID per column; mixed-SRID types are for in-memory/query use. When writing to these formats, a concrete (fixed) SRID is required. +### Supported SRIDs + +Spark includes a pre-built registry of standard Spatial Reference Identifiers (SRIDs) from the PROJ database, with overrides to support OGC standards. This registry enables validation and proper handling of coordinate systems for geospatial data. + +#### Commonly Used SRIDs + +| SRID | Name | Description | Typical Use Case | +|------|------|-------------|------------------| +| 4326 | WGS 84 | World Geodetic System 1984 (latitude/longitude) | GPS coordinates, global data (default for GEOGRAPHY) | +| 3857 | Web Mercator | Pseudo-Mercator projection used by web mapping services | Web maps (Google Maps, OpenStreetMap, Bing Maps) | +| 2154 | RGF93 / Lambert-93 | French national coordinate system | France-specific mapping and GIS | +| 32633 | WGS 84 / UTM zone 33N | Universal Transverse Mercator, zone 33 North | Central Europe (6°E to 12°E) | +| 32634 | WGS 84 / UTM zone 34N | Universal Transverse Mercator, zone 34 North | Eastern Europe (12°E to 18°E) | +| 32635 | WGS 84 / UTM zone 35N | Universal Transverse Mercator, zone 35 North | Eastern Europe/Western Asia (18°E to 24°E) | Review Comment: Consider adding a **CRS Identifier** column. Spark maps SRIDs to CRS strings internally, and these strings are visible to users in `df.schema.json()` output and in Parquet/Delta/Iceberg storage metadata. For example, `GEOMETRY(4326)` stores as `geometry(OGC:CRS84)` in JSON schema — not `EPSG:4326`. This is a common source of confusion. The key mappings are: | SRID | CRS Identifier | |------|---------------| | 0 | `SRID:0` | | 3857 | `EPSG:3857` | | 4326 | `OGC:CRS84` | | 4267 | `OGC:CRS27` | | 4269 | `OGC:CRS83` | Also worth noting which SRIDs are valid for GEOGRAPHY vs GEOMETRY. For instance, `GEOMETRY(3857)` works but `GEOGRAPHY(3857)` will error because 3857 is a projected (non-geographic) CRS. That's a real pitfall for users. ########## docs/sql-ref-geospatial-types.md: ########## @@ -142,6 +142,92 @@ SELECT ST_Srid(ST_SetSrid(ST_GeomFromWKB(X'0101000000000000000000F03F00000000000 * **Fixed-SRID columns**: Every value in the column must have the same SRID as the column type. Inserting a value with a different SRID can raise an error (or you can use `ST_SetSrid` to set the value’s SRID to match the column). * **Mixed-SRID columns** (`GEOMETRY(ANY)` or `GEOGRAPHY(ANY)`): Values can have different SRIDs. Only valid SRIDs are allowed. * **Storage**: Parquet, Delta, and Iceberg store geometry/geography with a fixed SRID per column; mixed-SRID types are for in-memory/query use. When writing to these formats, a concrete (fixed) SRID is required. +### Supported SRIDs + +Spark includes a pre-built registry of standard Spatial Reference Identifiers (SRIDs) from the PROJ database, with overrides to support OGC standards. This registry enables validation and proper handling of coordinate systems for geospatial data. + +#### Commonly Used SRIDs + +| SRID | Name | Description | Typical Use Case | +|------|------|-------------|------------------| +| 4326 | WGS 84 | World Geodetic System 1984 (latitude/longitude) | GPS coordinates, global data (default for GEOGRAPHY) | +| 3857 | Web Mercator | Pseudo-Mercator projection used by web mapping services | Web maps (Google Maps, OpenStreetMap, Bing Maps) | +| 2154 | RGF93 / Lambert-93 | French national coordinate system | France-specific mapping and GIS | +| 32633 | WGS 84 / UTM zone 33N | Universal Transverse Mercator, zone 33 North | Central Europe (6°E to 12°E) | +| 32634 | WGS 84 / UTM zone 34N | Universal Transverse Mercator, zone 34 North | Eastern Europe (12°E to 18°E) | +| 32635 | WGS 84 / UTM zone 35N | Universal Transverse Mercator, zone 35 North | Eastern Europe/Western Asia (18°E to 24°E) | + +The registry includes many additional SRIDs for various UTM zones, national coordinate systems, and other projections. For a complete list, refer to the [EPSG Geodetic Parameter Dataset](https://epsg.org/). + +#### Using Different SRIDs + +**Creating tables with specific SRIDs:** Review Comment: Most of the examples in sections "Using Different SRIDs", "Converting between SRIDs", and "SRID Validation" repeat what the page already covers in "Creating Tables" (lines 62–79) and "Built-in Geospatial Functions" (lines 129–137). Consider replacing them with examples that show genuinely new behavior: - **SRID validation error**: The 99999 case is useful — keep it. - **GEOGRAPHY vs GEOMETRY pitfall**: Show that `GEOGRAPHY(3857)` errors because 3857 is non-geographic — this is a real user trap not documented elsewhere. - **OGC CRS strings in metadata**: Show that `df.schema.json()` for `GEOMETRY(4326)` contains `OGC:CRS84`, so users know what to expect in Parquet/storage metadata. ########## docs/sql-ref-geospatial-types.md: ########## @@ -142,6 +142,92 @@ SELECT ST_Srid(ST_SetSrid(ST_GeomFromWKB(X'0101000000000000000000F03F00000000000 * **Fixed-SRID columns**: Every value in the column must have the same SRID as the column type. Inserting a value with a different SRID can raise an error (or you can use `ST_SetSrid` to set the value’s SRID to match the column). * **Mixed-SRID columns** (`GEOMETRY(ANY)` or `GEOGRAPHY(ANY)`): Values can have different SRIDs. Only valid SRIDs are allowed. * **Storage**: Parquet, Delta, and Iceberg store geometry/geography with a fixed SRID per column; mixed-SRID types are for in-memory/query use. When writing to these formats, a concrete (fixed) SRID is required. +### Supported SRIDs + +Spark includes a pre-built registry of standard Spatial Reference Identifiers (SRIDs) from the PROJ database, with overrides to support OGC standards. This registry enables validation and proper handling of coordinate systems for geospatial data. + +#### Commonly Used SRIDs + +| SRID | Name | Description | Typical Use Case | +|------|------|-------------|------------------| +| 4326 | WGS 84 | World Geodetic System 1984 (latitude/longitude) | GPS coordinates, global data (default for GEOGRAPHY) | +| 3857 | Web Mercator | Pseudo-Mercator projection used by web mapping services | Web maps (Google Maps, OpenStreetMap, Bing Maps) | +| 2154 | RGF93 / Lambert-93 | French national coordinate system | France-specific mapping and GIS | +| 32633 | WGS 84 / UTM zone 33N | Universal Transverse Mercator, zone 33 North | Central Europe (6°E to 12°E) | +| 32634 | WGS 84 / UTM zone 34N | Universal Transverse Mercator, zone 34 North | Eastern Europe (12°E to 18°E) | +| 32635 | WGS 84 / UTM zone 35N | Universal Transverse Mercator, zone 35 North | Eastern Europe/Western Asia (18°E to 24°E) | + +The registry includes many additional SRIDs for various UTM zones, national coordinate systems, and other projections. For a complete list, refer to the [EPSG Geodetic Parameter Dataset](https://epsg.org/). + +#### Using Different SRIDs + +**Creating tables with specific SRIDs:** + +```sql +-- Web Mercator projection (common for web mapping applications) +CREATE TABLE web_map_data ( + id BIGINT, + location GEOMETRY(3857) +); + +-- UTM zone 33N for Central Europe +CREATE TABLE europe_survey_data ( + id BIGINT, + measurement_point GEOMETRY(32633) +); + +-- French national grid +CREATE TABLE france_cadastre ( + id BIGINT, + parcel GEOMETRY(2154) +); +``` + +**Converting between SRIDs:** Review Comment: The heading "Converting between SRIDs" implies coordinate reprojection, but `ST_SetSrid` only changes metadata. Suggest renaming to something like **"Setting or Changing SRID Metadata"**. Also, the example changes a point from SRID 4326 (lat/lon in degrees) to 3857 (Web Mercator in meters) — this produces a semantically incorrect result since the coordinates are still degree values but now labeled as meters. A better example would set SRID on data that was created without one, e.g. SRID 0 → 4326, which is the common real-world use case. The existing doc already shows an `ST_SetSrid` example (line 136) that does this correctly. ########## docs/sql-ref-geospatial-types.md: ########## @@ -142,6 +142,92 @@ SELECT ST_Srid(ST_SetSrid(ST_GeomFromWKB(X'0101000000000000000000F03F00000000000 * **Fixed-SRID columns**: Every value in the column must have the same SRID as the column type. Inserting a value with a different SRID can raise an error (or you can use `ST_SetSrid` to set the value’s SRID to match the column). * **Mixed-SRID columns** (`GEOMETRY(ANY)` or `GEOGRAPHY(ANY)`): Values can have different SRIDs. Only valid SRIDs are allowed. * **Storage**: Parquet, Delta, and Iceberg store geometry/geography with a fixed SRID per column; mixed-SRID types are for in-memory/query use. When writing to these formats, a concrete (fixed) SRID is required. +### Supported SRIDs + +Spark includes a pre-built registry of standard Spatial Reference Identifiers (SRIDs) from the PROJ database, with overrides to support OGC standards. This registry enables validation and proper handling of coordinate systems for geospatial data. + +#### Commonly Used SRIDs + +| SRID | Name | Description | Typical Use Case | +|------|------|-------------|------------------| +| 4326 | WGS 84 | World Geodetic System 1984 (latitude/longitude) | GPS coordinates, global data (default for GEOGRAPHY) | +| 3857 | Web Mercator | Pseudo-Mercator projection used by web mapping services | Web maps (Google Maps, OpenStreetMap, Bing Maps) | +| 2154 | RGF93 / Lambert-93 | French national coordinate system | France-specific mapping and GIS | +| 32633 | WGS 84 / UTM zone 33N | Universal Transverse Mercator, zone 33 North | Central Europe (6°E to 12°E) | +| 32634 | WGS 84 / UTM zone 34N | Universal Transverse Mercator, zone 34 North | Eastern Europe (12°E to 18°E) | +| 32635 | WGS 84 / UTM zone 35N | Universal Transverse Mercator, zone 35 North | Eastern Europe/Western Asia (18°E to 24°E) | + +The registry includes many additional SRIDs for various UTM zones, national coordinate systems, and other projections. For a complete list, refer to the [EPSG Geodetic Parameter Dataset](https://epsg.org/). + +#### Using Different SRIDs + +**Creating tables with specific SRIDs:** + +```sql +-- Web Mercator projection (common for web mapping applications) +CREATE TABLE web_map_data ( + id BIGINT, + location GEOMETRY(3857) +); + +-- UTM zone 33N for Central Europe +CREATE TABLE europe_survey_data ( + id BIGINT, + measurement_point GEOMETRY(32633) +); + +-- French national grid +CREATE TABLE france_cadastre ( + id BIGINT, + parcel GEOMETRY(2154) +); +``` + +**Converting between SRIDs:** + +```sql +-- Create a point in WGS 84 (SRID 4326) +SELECT ST_GeomFromWKB(X'0101000000000000000000F03F0000000000000040', 4326) AS point_wgs84; + +-- Change SRID to Web Mercator (note: this only changes the SRID metadata, not the coordinates) +SELECT ST_SetSrid( + ST_GeomFromWKB(X'0101000000000000000000F03F0000000000000040', 4326), + 3857 +) AS point_web_mercator; +``` + +**Important:** `ST_SetSrid` only changes the SRID metadata; it does not transform coordinates. For actual coordinate transformation between different coordinate systems, use appropriate transformation functions or external tools. + +#### SRID Validation + +When creating GEOMETRY or GEOGRAPHY values, Spark validates that the specified SRID exists in the pre-built registry. Using an unsupported or invalid SRID will result in an error. + +```sql +-- Valid: 4326 is in the registry +SELECT ST_GeomFromWKB(X'0101000000000000000000F03F0000000000000040', 4326); Review Comment: The 4326 and 3857 examples here repeat what's already shown in the "Built-in Geospatial Functions" section above. Consider trimming to just the 99999 error case — that's the genuinely new and useful example. You could also add a `GEOGRAPHY(3857)` failure example here, since that's a real pitfall not documented elsewhere. ########## docs/sql-ref-geospatial-types.md: ########## @@ -142,6 +142,92 @@ SELECT ST_Srid(ST_SetSrid(ST_GeomFromWKB(X'0101000000000000000000F03F00000000000 * **Fixed-SRID columns**: Every value in the column must have the same SRID as the column type. Inserting a value with a different SRID can raise an error (or you can use `ST_SetSrid` to set the value’s SRID to match the column). * **Mixed-SRID columns** (`GEOMETRY(ANY)` or `GEOGRAPHY(ANY)`): Values can have different SRIDs. Only valid SRIDs are allowed. * **Storage**: Parquet, Delta, and Iceberg store geometry/geography with a fixed SRID per column; mixed-SRID types are for in-memory/query use. When writing to these formats, a concrete (fixed) SRID is required. +### Supported SRIDs + +Spark includes a pre-built registry of standard Spatial Reference Identifiers (SRIDs) from the PROJ database, with overrides to support OGC standards. This registry enables validation and proper handling of coordinate systems for geospatial data. + +#### Commonly Used SRIDs + +| SRID | Name | Description | Typical Use Case | +|------|------|-------------|------------------| +| 4326 | WGS 84 | World Geodetic System 1984 (latitude/longitude) | GPS coordinates, global data (default for GEOGRAPHY) | +| 3857 | Web Mercator | Pseudo-Mercator projection used by web mapping services | Web maps (Google Maps, OpenStreetMap, Bing Maps) | +| 2154 | RGF93 / Lambert-93 | French national coordinate system | France-specific mapping and GIS | +| 32633 | WGS 84 / UTM zone 33N | Universal Transverse Mercator, zone 33 North | Central Europe (6°E to 12°E) | +| 32634 | WGS 84 / UTM zone 34N | Universal Transverse Mercator, zone 34 North | Eastern Europe (12°E to 18°E) | +| 32635 | WGS 84 / UTM zone 35N | Universal Transverse Mercator, zone 35 North | Eastern Europe/Western Asia (18°E to 24°E) | + +The registry includes many additional SRIDs for various UTM zones, national coordinate systems, and other projections. For a complete list, refer to the [EPSG Geodetic Parameter Dataset](https://epsg.org/). + +#### Using Different SRIDs + +**Creating tables with specific SRIDs:** + +```sql +-- Web Mercator projection (common for web mapping applications) +CREATE TABLE web_map_data ( + id BIGINT, + location GEOMETRY(3857) +); + +-- UTM zone 33N for Central Europe +CREATE TABLE europe_survey_data ( + id BIGINT, + measurement_point GEOMETRY(32633) +); + +-- French national grid +CREATE TABLE france_cadastre ( + id BIGINT, + parcel GEOMETRY(2154) +); +``` + +**Converting between SRIDs:** + +```sql +-- Create a point in WGS 84 (SRID 4326) +SELECT ST_GeomFromWKB(X'0101000000000000000000F03F0000000000000040', 4326) AS point_wgs84; + +-- Change SRID to Web Mercator (note: this only changes the SRID metadata, not the coordinates) +SELECT ST_SetSrid( + ST_GeomFromWKB(X'0101000000000000000000F03F0000000000000040', 4326), + 3857 +) AS point_web_mercator; +``` + +**Important:** `ST_SetSrid` only changes the SRID metadata; it does not transform coordinates. For actual coordinate transformation between different coordinate systems, use appropriate transformation functions or external tools. + +#### SRID Validation + +When creating GEOMETRY or GEOGRAPHY values, Spark validates that the specified SRID exists in the pre-built registry. Using an unsupported or invalid SRID will result in an error. + +```sql +-- Valid: 4326 is in the registry +SELECT ST_GeomFromWKB(X'0101000000000000000000F03F0000000000000040', 4326); +-- Returns: GEOMETRY with SRID 4326 + +-- Valid: 3857 (Web Mercator) is in the registry +SELECT ST_GeomFromWKB(X'0101000000000000000000F03F0000000000000040', 3857); +-- Returns: GEOMETRY with SRID 3857 + +-- Error: 99999 is not a valid SRID in the registry +SELECT ST_GeomFromWKB(X'0101000000000000000000F03F0000000000000040', 99999); +-- Throws error: Invalid SRID +``` + +#### SRID 0 (Unspecified) + +SRID 0 represents an unspecified or unknown coordinate system. It is allowed for GEOMETRY types but should be used with caution: Review Comment: A few issues here: 1. **"should be used with caution"** is overstated — SRID 0 is the default for `ST_GeomFromWKB(wkb)` and is actively used in `CREATE TABLE` (e.g., `CREATE TABLE t (geom GEOMETRY(0)) USING PARQUET` in the test suite). It's a standard convention (PostGIS uses the same). 2. **Missing GEOGRAPHY restriction** — SRID 0 is **not** valid for GEOGRAPHY types (it's registered as non-geographic, so `GeographicSpatialReferenceSystemMapper` rejects it). This is important to document. 3. **Could be confused with `GEOMETRY(ANY)`** — Worth clarifying that `GEOMETRY(0)` means a fixed SRID of 0 (Cartesian, no defined CRS), not "per-row SRID." Per-row SRIDs use `GEOMETRY(ANY)`. ########## docs/sql-ref-geospatial-types.md: ########## @@ -142,6 +142,92 @@ SELECT ST_Srid(ST_SetSrid(ST_GeomFromWKB(X'0101000000000000000000F03F00000000000 * **Fixed-SRID columns**: Every value in the column must have the same SRID as the column type. Inserting a value with a different SRID can raise an error (or you can use `ST_SetSrid` to set the value’s SRID to match the column). * **Mixed-SRID columns** (`GEOMETRY(ANY)` or `GEOGRAPHY(ANY)`): Values can have different SRIDs. Only valid SRIDs are allowed. * **Storage**: Parquet, Delta, and Iceberg store geometry/geography with a fixed SRID per column; mixed-SRID types are for in-memory/query use. When writing to these formats, a concrete (fixed) SRID is required. +### Supported SRIDs + +Spark includes a pre-built registry of standard Spatial Reference Identifiers (SRIDs) from the PROJ database, with overrides to support OGC standards. This registry enables validation and proper handling of coordinate systems for geospatial data. + +#### Commonly Used SRIDs + +| SRID | Name | Description | Typical Use Case | +|------|------|-------------|------------------| +| 4326 | WGS 84 | World Geodetic System 1984 (latitude/longitude) | GPS coordinates, global data (default for GEOGRAPHY) | +| 3857 | Web Mercator | Pseudo-Mercator projection used by web mapping services | Web maps (Google Maps, OpenStreetMap, Bing Maps) | +| 2154 | RGF93 / Lambert-93 | French national coordinate system | France-specific mapping and GIS | +| 32633 | WGS 84 / UTM zone 33N | Universal Transverse Mercator, zone 33 North | Central Europe (6°E to 12°E) | +| 32634 | WGS 84 / UTM zone 34N | Universal Transverse Mercator, zone 34 North | Eastern Europe (12°E to 18°E) | +| 32635 | WGS 84 / UTM zone 35N | Universal Transverse Mercator, zone 35 North | Eastern Europe/Western Asia (18°E to 24°E) | + +The registry includes many additional SRIDs for various UTM zones, national coordinate systems, and other projections. For a complete list, refer to the [EPSG Geodetic Parameter Dataset](https://epsg.org/). Review Comment: The registry also includes **ESRI** entries (e.g., `ESRI:102100`), not just EPSG. And it's pinned to **PROJ 9.7.1** — not synced live with EPSG. The link to epsg.org could be misleading since users may find SRIDs there that aren't in Spark's registry, or miss ESRI SRIDs that are. Consider referencing the actual registry CSV or at least mentioning the PROJ version and ESRI inclusion. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
