paleolimbot commented on code in PR #240:
URL: https://github.com/apache/parquet-format/pull/240#discussion_r1800409140
##########
src/main/thrift/parquet.thrift:
##########
@@ -380,6 +410,38 @@ struct JsonType {
struct BsonType {
}
+/** Physical type and encoding for the geometry type */
+enum GeometryEncoding {
+ /**
+ * Allowed for physical type: BYTE_ARRAY.
+ *
+ * Well-known binary (WKB) representations of geometries.
+ */
+ WKB = 0;
+}
+
+/** Interpretation for edges of elements of a GEOMETRY type */
+enum Edges {
+ PLANAR = 0;
+ SPHERICAL = 1;
+}
+
+/**
+ * GEOMETRY logical type annotation (added in 2.11.0)
+ *
+ * GeometryEncoding and Edges are required. CRS is optional.
+ *
+ * Once CRS is set, it MUST be a key to an entry in the `key_value_metadata`
+ * field of `FileMetaData`.
Review Comment:
A string property of "Coordinate reference system identifier" (with a
convention, either within this spec or outside it, of where in the file to look
for the full definition) would allow for enough detail for GeoSpatial libraries
to leverage Parquet.
The need for embedding a full CRS description somewhere that is
programatically accessible by a Parquet implementation is to ensure a
producer's intent can be faithfully transported by the producer. In the C++
implementation we can attach this as extension type metadata that can pass
through a pipeline to a consumer that does not have access to the original
context (e.g., constructing a GeoPandas GeoDataFrame from a Parquet file that
was read and filtered using a non-spatial tool like pyarrow). If that needs to
be an external convention (e.g., one that we define in GeoParquet) to get
consensus here that is OK (even though I think it would result in less
misinterpreted data to have that convention be in the Parquet specification
itself).
Alternatively, would removing any conventions or requirements around the
`string crs` be acceptable? (i.e., the producer puts what it needs to put there
to ensure that the coordinates in this column are not misinterpreted by the
consumer, which may be an identifier or a full CRS definition according to the
requirements of the producer?).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]