BlakeOrth commented on code in PR #8801:
URL: https://github.com/apache/arrow-rs/pull/8801#discussion_r2525370481
##########
parquet-geospatial/src/types.rs:
##########
@@ -0,0 +1,210 @@
+use std::sync::Arc;
+
+use arrow::array::{Array, ArrayRef, BinaryArray, BinaryViewArray,
LargeBinaryArray, make_array};
+use arrow::buffer::NullBuffer;
+use arrow::error::Result;
+use arrow_schema::Field;
+use arrow_schema::{ArrowError, DataType, extension::ExtensionType};
+use serde::{Deserialize, Serialize};
+
+#[derive(Copy, Clone, Debug, Serialize, Deserialize)]
+pub enum Hint {
+ Geometry,
+ Geography,
+}
+
+#[derive(Clone, Debug, Default, Serialize, Deserialize)]
+pub struct Metadata {
+ #[serde(skip_serializing_if = "Option::is_none")]
+ pub crs: Option<String>, // TODO: explore when this is valid JSON to avoid
double escaping
+ #[serde(skip_serializing_if = "Option::is_none")]
+ pub algorithm: Option<String>,
+ #[serde(skip_serializing_if = "Option::is_none")]
+ type_hint: Option<Hint>,
+}
Review Comment:
I want to highlight this change because I think it merits discussion. I've
introduced a `type_hint` into the metadata for this extension because I believe
it's necessary to solve a round-trip for a specific ambiguous case.
If my reading of the spec is correct, having a parquet `GEOGRAPHY` logical
type with no metadata is a valid case (the default values of `crs: OGC:CRS84`
and `algorithm: SPHERICAL` would apply for readers of the column). If we want
to support writing this case while preserving the (lack of) input metadata we
need some additional information to inform us of the underlying type of the
array data.
This field is obviously not a part of the parquet metadata spec, nor the
GeoArrow metadata spec. However, the `WkbArray` that has been introduced here
doesn't claim to be either, so perhaps this is ok. If we want to ensure the
metadata here strictly follows one spec or the other we will likely have to
drop back to determining the parquet type of the array heuristically based on
the `algorithm` field which could result in incorrect logical types for the
ambiguous case. While I suspect this case is relatively rare, I think it's
important to note it could happen.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]