paleolimbot commented on code in PR #20312:
URL: https://github.com/apache/datafusion/pull/20312#discussion_r2799655997


##########
datafusion/expr/src/registry.rs:
##########
@@ -215,3 +218,266 @@ impl FunctionRegistry for MemoryFunctionRegistry {
         self.udwfs.keys().cloned().collect()
     }
 }
+
+/// A cheaply cloneable pointer to an [ExtensionTypeRegistration].
+pub type ExtensionTypeRegistrationRef = Arc<dyn ExtensionTypeRegistration>;
+
+/// The registration of an extension type. Implementations of this trait are 
responsible for
+/// *creating* instances of [`DFExtensionType`] that represent the entire 
semantics of an extension
+/// type.
+///
+/// # Why do we need a Registration?
+///
+/// A good question is why this trait is even necessary. Why not directly 
register the
+/// [`DFExtensionType`] in a registration?

Review Comment:
   Thank you for this write up! I think the way you've done it here is 
great...in the context of the existing `ExtensionType` trait, the registration 
is basically `Self` (i.e., `Self::try_from_field(...)`) and the 
`DFExtensionType` is basically `self` (i.e., 
`self.do_stuff_with_a_data_type()`).
   
   Arrow C++ does register an instance of the `DFExtensionType` equivalent 
instead of having a separate registration class, where every instance of an 
extension type has a `.Deserialize()` method that can create a new instance of 
itself; however, I think the way you've done it here is much cleaner.



##########
datafusion/common/src/types/extension.rs:
##########
@@ -0,0 +1,71 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::error::Result;
+use arrow::array::Array;
+use arrow::util::display::{ArrayFormatter, FormatOptions};
+use std::fmt::Debug;
+use std::sync::Arc;
+
+/// A cheaply cloneable pointer to a [`DFExtensionType`].
+pub type DFExtensionTypeRef = Arc<dyn DFExtensionType>;
+
+/// Represents an implementation of a DataFusion extension type.
+///
+/// This allows users to customize the behavior of DataFusion for certain 
types. Having this ability
+/// is necessary because extension types affect how columns should be treated 
by the query engine.
+/// This effect includes which operations are possible on a column and what 
are the expected results
+/// from these operations. The extension type mechanism allows users to define 
how these operations
+/// apply to a particular extension type.
+///
+/// For example, adding two values of 
[`Int64`](arrow::datatypes::DataType::Int64) is a sensible
+/// thing to do. However, if the same column is annotated with an extension 
type like `custom.id`,
+/// the correct interpretation of a column changes. Adding together two 
`custom.id` values, even
+/// though they are stored as integers, may no longer make sense.
+///
+/// Note that DataFusion's extension type support is still young and therefore 
might not cover all
+/// relevant use cases. Currently, the following operations can be customized:
+/// - Pretty-printing values in record batches
+///
+/// # Relation to Arrow's `ExtensionType`
+///
+/// The purpose of Arrow's `ExtensionType` trait, for the time being, is to 
allow reading and
+/// writing extension type metadata in a type-safe manner. The trait does not 
provide any
+/// customization options. Therefore, downstream users (such as DataFusion) 
have the flexibility to
+/// implement the extension type mechanism according to their needs. 
[`DFExtensionType`] is
+/// DataFusion's implementation of this extension type mechanism.
+///
+/// Furthermore, the current trait in arrow-rs is not dyn-compatible, which we 
need for implementing
+/// extension type registries. In the future, the two implementations may 
increasingly converge.
+///
+/// # Examples
+///
+/// Examples for using the extension type machinery can be found in the 
DataFusion examples
+/// directory.
+pub trait DFExtensionType: Debug + Send + Sync {
+    /// Returns an [`ArrayFormatter`] that can format values of this type.
+    ///
+    /// If `Ok(None)` is returned, the default implementation will be used.
+    /// If an error is returned, there was an error creating the formatter.
+    fn create_array_formatter<'fmt>(
+        &self,
+        _array: &'fmt dyn Array,
+        _options: &FormatOptions<'fmt>,
+    ) -> Result<Option<ArrayFormatter<'fmt>>> {
+        Ok(None)
+    }
+}

Review Comment:
   Just highlighting for readers that this is the crux: there is now a trait 
that centralizes the properties of a non-built-in-Arrow DataType that 
DataFusion internals have access to. Pretty printing is a relatively 
straightforward initial target that is particularly useful for the CLI and 
testing; however, there are more things that can be added here (or not):
   
   - Casting extension arrays (this would unlock things like 
`'00010203-0405-0607-0809-000102030506'::UUID` or for me, `'POINT (0 
1)'::GEOMETRY`). The CSV writer could also leverage this by casting extension 
types to string. This would probably also be sufficient to make 
`some_uuid_column = '00010203-0405-0607-0809-000102030506'` work as well 
because I think `=` currently works by casting both sides to a common type.
   - Sort order (I think this is what got Tobias started on all of this and it 
also benefits geometry)



##########
datafusion/core/tests/extension_types/pretty_printing.rs:
##########
@@ -0,0 +1,75 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use arrow::array::{FixedSizeBinaryArray, RecordBatch};
+use arrow_schema::extension::Uuid;
+use arrow_schema::{DataType, Field, Schema, SchemaRef};
+use datafusion::dataframe::DataFrame;
+use datafusion::error::Result;
+use datafusion::execution::SessionStateBuilder;
+use datafusion::prelude::SessionContext;
+use insta::assert_snapshot;
+use std::sync::Arc;
+
+fn test_schema() -> SchemaRef {
+    Arc::new(Schema::new(vec![uuid_field()]))
+}
+
+fn uuid_field() -> Field {
+    Field::new("my_uuids", DataType::FixedSizeBinary(16), 
false).with_extension_type(Uuid)
+}
+
+async fn create_test_table() -> Result<DataFrame> {
+    let schema = test_schema();
+
+    // define data.
+    let batch = RecordBatch::try_new(
+        schema,
+        vec![Arc::new(FixedSizeBinaryArray::from(vec![
+            &[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
+            &[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 5, 6],
+        ]))],
+    )?;
+
+    let state = SessionStateBuilder::default()
+        .with_canonical_extension_types()?
+        .build();
+    let ctx = SessionContext::new_with_state(state);
+
+    ctx.register_batch("test", batch)?;
+
+    ctx.table("test").await
+}
+
+#[tokio::test]
+async fn test_pretty_print_logical_plan() -> Result<()> {
+    let result = create_test_table().await?.to_string().await?;
+
+    assert_snapshot!(
+        result,
+        @r"
+    +--------------------------------------+
+    | my_uuids                             |
+    +--------------------------------------+
+    | 00000000-0000-0000-0000-000000000000 |
+    | 00010203-0405-0607-0809-000102030506 |
+    +--------------------------------------+
+    "
+    );

Review Comment:
   🎉 !
   
   This is actually one of the things I am looking forward the most with pretty 
printing...it means that `assert_batches_eq!()` will work out of the box and 
some of the testing workarounds we've been using can collapse to nicely 
readable tests.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to