paleolimbot commented on code in PR #240:
URL: https://github.com/apache/parquet-format/pull/240#discussion_r1800409140


##########
src/main/thrift/parquet.thrift:
##########
@@ -380,6 +410,38 @@ struct JsonType {
 struct BsonType {
 }
 
+/** Physical type and encoding for the geometry type */
+enum GeometryEncoding {
+  /**
+   * Allowed for physical type: BYTE_ARRAY.
+   *
+   * Well-known binary (WKB) representations of geometries.
+   */
+  WKB = 0;
+}
+
+/** Interpretation for edges of elements of a GEOMETRY type */
+enum Edges {
+  PLANAR = 0;
+  SPHERICAL = 1;
+}
+
+/**
+ * GEOMETRY logical type annotation (added in 2.11.0)
+ *
+ * GeometryEncoding and Edges are required. CRS is optional.
+ *
+ * Once CRS is set, it MUST be a key to an entry in the `key_value_metadata`
+ * field of `FileMetaData`.

Review Comment:
   A string property of "Coordinate reference system identifier" (with a 
convention, either within this spec or outside it, of where in the file to look 
for the full definition) would allow for enough detail for GeoSpatial libraries 
to leverage Parquet.
   
   The need for embedding a full CRS description somewhere that is 
programatically accessible by a Parquet implementation is to ensure a 
producer's intent can be faithfully transported by the consumer. In the C++ 
implementation we can attach this as extension type metadata that can pass 
through a pipeline to a consumer that does not have access to the original 
context (e.g., constructing a GeoPandas GeoDataFrame from a Parquet file that 
was read and filtered using a non-spatial tool like pyarrow). If that needs to 
be an external convention (e.g., one that we define in GeoParquet) to get 
consensus here that is OK (even though I think it would result in less 
misinterpreted data to have that convention be in the Parquet specification 
itself).
   
   Alternatively, would removing any conventions or requirements around the 
`string crs` be acceptable? (i.e., the producer puts what it needs to put there 
to ensure that the coordinates in this column are not misinterpreted by the 
consumer, which may be an identifier or a full CRS definition according to the 
requirements of the producer?).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to