scovich commented on code in PR #8365:
URL: https://github.com/apache/arrow-rs/pull/8365#discussion_r2355594747
##########
parquet-variant-compute/src/variant_array.rs:
##########
@@ -24,12 +24,54 @@ use arrow::datatypes::{
Float16Type, Float32Type, Float64Type, Int16Type, Int32Type, Int64Type,
Int8Type, UInt16Type,
UInt32Type, UInt64Type, UInt8Type,
};
+use arrow_schema::extension::ExtensionType;
use arrow_schema::{ArrowError, DataType, Field, FieldRef, Fields};
use parquet_variant::Uuid;
use parquet_variant::Variant;
use std::any::Any;
use std::sync::Arc;
+/// Variant Canonical Extension Type
+pub struct VariantType;
+
+impl ExtensionType for VariantType {
+ const NAME: &'static str = "parquet.variant";
+
+ // Variants have no extension metadata
+ type Metadata = ();
+
+ fn metadata(&self) -> &Self::Metadata {
+ &()
+ }
+
+ fn serialize_metadata(&self) -> Option<String> {
+ None
+ }
+
+ fn deserialize_metadata(_metadata: Option<&str>) -> Result<Self::Metadata,
ArrowError> {
+ Ok(())
+ }
+
+ fn supports_data_type(&self, data_type: &DataType) -> Result<(),
ArrowError> {
+ // Note don't check for metadata/value fields here because they may be
+ // absent in shredded variants
+ if matches!(data_type, DataType::Struct(_)) {
+ Ok(())
Review Comment:
One thing I thought of while working on shredding, variant_get, etc -- what
if we supported a metadata that indicated the desired handling of shredded
columns? So for example (naming TBD), a caller could annotate a binary variant
type with "keep-shredding" to indicate that reads and variant_get calls should
preserve the existing shredding structure of the underlying data, or "unshred"
to indicate that such calls should return binary variant even if the underlying
data are shredded. The output data would not contain any annotation, because it
is what it is.
Such an approach would solve the problem of how to request partially
shredded results using `variant_get`, for example. Today, we can get that
behavior by passing `as_type=None`, but that doesn't work if I want a nested
type with some fields shredded and others left as-is.
##########
parquet-variant-compute/src/variant_array.rs:
##########
@@ -246,6 +288,57 @@ impl VariantArray {
pub fn typed_value_field(&self) -> Option<&ArrayRef> {
self.shredding_state.typed_value_field()
}
+
+ // TODO return a field with the correct metadata?
+ pub fn data_type(&self) -> &DataType {
+ self.inner.data_type()
+ }
+
+ pub fn slice(&self, offset: usize, length: usize) -> Self {
+ let inner = self.inner.slice(offset, length);
+ let metadata = self.metadata.slice(offset, length);
+ let shredding_state = self.shredding_state.slice(offset, length);
+ Self {
+ inner,
+ metadata,
+ shredding_state,
+ }
+ }
+
+ pub fn len(&self) -> usize {
+ self.inner.len()
+ }
+
+ pub fn is_empty(&self) -> bool {
+ self.inner.is_empty()
+ }
+
+ pub fn nulls(&self) -> Option<&NullBuffer> {
+ self.inner.nulls()
+ }
+
+ /// Is the element at index null?
+ pub fn is_null(&self, index: usize) -> bool {
+ // TODO: should we account for shredded nulls here?
Review Comment:
Do you mean, return true here if `Variant::Null` in the `value` column and
`NULL` in the `typed_value`?
I think that's more a job for `logical_nulls`?
##########
parquet-variant-compute/src/variant_array.rs:
##########
@@ -617,59 +710,6 @@ fn typed_value_to_variant(typed_value: &ArrayRef, index:
usize) -> Variant<'_, '
}
}
-impl Array for VariantArray {
Review Comment:
I think we also need to remove the `impl Array` for the shredded object
field array as well?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]