yukkit opened a new issue, #7923:
URL: https://github.com/apache/arrow-datafusion/issues/7923

   ### Is your feature request related to a problem or challenge?
   
   I've noticed there are some issues regarding adding extension types in 
DataFusion.
   
   - https://github.com/apache/arrow-datafusion/issues/7859
   - https://github.com/apache/arrow-datafusion/issues/7845
   - https://github.com/apache/arrow-rs/issues/4472
   
   Providing an interface for adding extension types in DataFusion would be 
highly meaningful. This would allow applications built on DataFusion to easily 
incorporate business-specific data types.
   
   I hope to promote the development of the UDT feature through this current 
proposal.
   
   ### Describe the solution you'd like
   
   # User-Defined Types (UDT)
   
   UDT stands for User-Defined Type. It is a feature in database systems that 
allows users to define their own custom data types based on existing data types 
provided by the database. This feature enables users to create data structures 
tailored to their specific needs, providing a higher level of abstraction and 
organization for complex data.
   
   ## Syntax
   
   ```sql
   <user-defined type definition> ::=
     CREATE TYPE <user-defined type name> AS <representation>
   
   <representation> ::=
     <predefined type>
   | <member list>
     
   <member list> ::=
     <left paren> <member> [ { <comma> <member> }... ] <right paren>
     
   <member> ::=
     <attribute name> <data type>
   
   <attribute name> ::=
     <identifier>
   ```
   
   ## Behaviors
   
   ### Behaviors of Data Types
   1. Type matching assessment.
   2. Computation of the common super type for two types.
   
   ### Behaviors of Data
   
   1. Inference of data type from literal value.
   2. Casting literal value to other type.
   3. Casting variable value to other type.
   4. **Import and export of data.** (Sensitive to logical data types)
   5. Operations like data comparison, etc.
   
   ## Role of Data Types in the SQL Lifecycle
   
   ### SQL Statement String -> AST
   
   None
   
   ### AST -> Logical Plan
   
   1. Create Type
      - Parsing, constructing, and storing the description of UDT.
   2. Create Table (Using Type)
      - Parsing data types
        * Built-in types
        * **User-defined types**
      - Constructing DFField (using metadata field to tag extended types), 
storing metadata.
   3. Query
      - How to construct extended data types?
        * **Use the STRUCT function**.
        * Use UDF.
      - How to perform relationship (comparison) operations, logical 
operations, arithmetic operations with other data types? How to perform type 
conversion?
        * Constant to UDT
          1. Use arrow conversion rules.
        * Variable to UDT
          1. Judge if cast can be performed according to arrow rules, and add 
cast expression as needed.
        * UDT to other data types
          1. Judge if cast can be performed according to arrow rules, and add 
cast expression as needed.
        > e.g. Any binary to UUID (DataType::FixedSizeBinary(16)), if data 
layout is the same but data content format is different, conversion is not 
possible. But from my understanding, UDT is not related to data content, only 
to data type, so this is not a problem.
      - Hashing, sorting?
        * Use arrow DataType.
   
   ### Logical Plan -> Execution Plan
   
   None
   
   ### Execution Plan -> ResultSet
   
   1. Cast
      * Execute according to arrow DataType's cast logic.
   2. Comparison, operations, etc.
      * Execute according to arrow DataType's logic.
   3. TableScan/TableWrite
      * **Identify extended types through Field metadata, thus performing 
special serialization or deserialization**.
   
   ## Core Structures
   
   ```rust
   /// UDT Signature
   /// <udt_name>[ (<param>[ {,<param>}... ]) ]
   pub struct TypeSignature<'a'> {
     name: Cow<'a, str>,
     params: Vec<Cow<'a, str>>,
   }
   
   /// UDT Entity
   pub struct UserDefinedType {
     signature: TypeSignature,
     physical_type: DataType,
   }
   
   impl UserDefinedType {
     /// Physical data type
     pub fn arrow_type(&self) -> DataType;
     /// Metadata used to tag extended data types
     pub fn metadata(&self) -> HashMap<String, String>;
   }
   
   pub trait ContextProvider { 
     /// Get UDT description by signature
     fn udt(&self, type_signature: TypeSignature) -> 
Result<Arc<UserDefinedType>>;
     
     ......
   }
   ```
   
   ## Examples
   
   ### create udt
   
   ```sql
   CREATE TYPE user_id_t AS BIGINT;
   CREATE TYPE email_t AS String;
   CREATE TYPE person_t AS (
     user_id user_id_t,
     first_name String,
     last_name String,
     age INTEGER,
     email email_t);
   
   DROP TYPE person_t;
   DROP TYPE email_t;
   DROP TYPE user_id_t;
   ```
   
   ### geoarrow
   
   https://github.com/geoarrow/geoarrow/blob/main/extension-types.md
   
   #### Point
   
   ```
   type_signature: Geometry(Point)
   arrow_type: DataType::FixedSizeList(xy, 2)
   metadata: { "ARROW:extension:name": "geoarrow.point" }
   ```
   
   ## Questions
   
   1. Is the UDF sensitive to extended types (e.g., encoding of extended type 
data in binary, where type tagging exists only in Field metadata and cannot be 
obtained during UDF runtime)?
   
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   @alamb  I am particularly eager to receive your feedback or suggestions on 
this proposal. Additionally, I highly encourage individuals who are familiar 
with or interested in this feature to contribute their improvement ideas.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to