tobixdev opened a new pull request, #18552:
URL: https://github.com/apache/datafusion/pull/18552

   ## Which issue does this PR close?
   
   This is a draft for #18223 . The APIs are not to be considered final (e.g., 
options are missing in the pretty printer). 
   The primary purpose is to spark discussion for now.
   
   So happy to hear inputs!
   
   ## Rationale for this change
   
   How cool would it be to just state that you should properly format my 
byte-encoded uuids? :)
   
   ## What changes are included in this PR?
   
   - Defines the `LogicalType` trait for some canonical extension types from 
arrow.
   - Defines `UnresolvedExtensionType`, a "DataFusion canonical extension type" 
that can be used to create a `LogicalType` instance even without a registry. 
The creation functions for `DFSchema` could make use of this type, assuming 
that `DFSchema` should have access to logical types. Furthermore, these 
function could directly instantiate the canonical arrow extension types as they 
are known to the system. Then the functions could resolve native and canonical 
extension types without an access to the registry and then "delay" the 
resolving of the custom extension types. The idea is that there is then a "Type 
Resolver Pass" that has access to a registry and replaces all instances of this 
type with the actual one. While I hope that this is only a temporary solution 
until all places have access to a logical type registry, I think this has the 
potential to become a "permanent temporary solution". With this in mind, we 
could also consider making this explicit with an enum and not hide it b
 ehind dynamic dispatch.
   - Defines an incomplete `ValuePrettyPrinter` for showcasing the UUID pretty 
printing.
   - Plumbing for having `ExtensionTypeRegistry` in `SessionState`
   
   What is also important is what is *not* included: an integrative example of 
making use of the pretty printer. I tried several avenues but, as you can 
imagine, each change to the core data structure is a huge plumbing effort 
(hopefully reduced by the existence of `UnresolvedLogicalType`).
   
   I really like the suggestion by @paleolimbot to use pretty-printing record 
batches as the first use case. You can see a mini example in the test that 
pretty-prints UUIDs. The nice thing is that this probably would not require 
much plumbing as the [DataFrame] already has access to the [SessionState]. The 
only thing that's missing for me to actually include this example here is that 
`arrow-rs` does not currently support passing custom pretty printers in 
`pretty_format_batches_with_options`.
   
   Imagine that the `to_string` function in the `DataFrame` does the following:
   1. Look up any extension type information from the schema (in a future world 
this would already be part of the schema and another lookup is not necessary)
   2. Gather the pretty printers
   3. Pass in pretty printer to arrow-rs for formatting.
   
   If you think this is a worthwhile pursuit we could add the capability to 
arrow-rs.
   
   ## Are these changes tested?
   
   Not really, as there is not integrative example yet.
   
   ## Are there any user-facing changes?
   
   There would be.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to