BlakeOrth commented on code in PR #8801:
URL: https://github.com/apache/arrow-rs/pull/8801#discussion_r2525370481


##########
parquet-geospatial/src/types.rs:
##########
@@ -0,0 +1,210 @@
+use std::sync::Arc;
+
+use arrow::array::{Array, ArrayRef, BinaryArray, BinaryViewArray, 
LargeBinaryArray, make_array};
+use arrow::buffer::NullBuffer;
+use arrow::error::Result;
+use arrow_schema::Field;
+use arrow_schema::{ArrowError, DataType, extension::ExtensionType};
+use serde::{Deserialize, Serialize};
+
+#[derive(Copy, Clone, Debug, Serialize, Deserialize)]
+pub enum Hint {
+    Geometry,
+    Geography,
+}
+
+#[derive(Clone, Debug, Default, Serialize, Deserialize)]
+pub struct Metadata {
+    #[serde(skip_serializing_if = "Option::is_none")]
+    pub crs: Option<String>, // TODO: explore when this is valid JSON to avoid 
double escaping
+    #[serde(skip_serializing_if = "Option::is_none")]
+    pub algorithm: Option<String>,
+    #[serde(skip_serializing_if = "Option::is_none")]
+    type_hint: Option<Hint>,
+}

Review Comment:
   I want to highlight this change because I think it merits discussion. I've 
introduced a `type_hint` into the metadata for this extension because I believe 
it's necessary to solve a round-trip for a specific ambiguous case.
   
   If my reading of the spec is correct, having a parquet `GEOGRAPHY` logical 
type with no metadata is a valid case (the default values of `crs: OGC:CRS84` 
and `algorithm: SPHERICAL` would apply for readers of the column). If we want 
to support writing this case while preserving the (lack of) input metadata we 
need some additional information to inform us of the underlying type of the 
array data.
   
   This field is obviously not a part of the parquet metadata spec, nor the 
GeoArrow metadata spec. However, the `WkbArray` that has been introduced here 
doesn't claim to be either, so perhaps this is ok. If we want to ensure the 
metadata here strictly follows one spec or the other we will likely have to 
drop back to determining the parquet type of the array heuristically based on 
the `algorithm` field which could result in incorrect logical types for the 
ambiguous case. While I suspect this case is relatively rare, I think it's 
important to note it could happen.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to