paleolimbot commented on issue #16116:
URL: https://github.com/apache/datafusion/issues/16116#issuecomment-3073876353

   Just listing a few specific places where I've had to integrate extension 
types outside of existing DataFusion mechanisms:
   
   - The `Signature` (i.e., how do you use the Signature mechanism to match a 
udf). One can also just use a kernel that matches anything and error in 
`return_field()` as well (and coerce the types yourself). If I'm remembering 
correctly the internal logic to coerce existing types isn't exposed in a way 
that makes it easy to that.
   - Casting: the `Cast` struct uses `DataType` (and anything that uses 
arrow-rs' casting will too). One can work around this by defining a UDF (e.g., 
`custom_cast(x, to)`) where `to` is a null scalar of the appropriate type.
   - Printing: the out-of-the box output you'll get is probably not what should 
be printed when you query a Parquet file with a variant column. (Could/should 
re-use the cast to string?)
   - CSV output (Could/should re-use the cast to string?)
   - The Parquet reader outputting the extension field (I assume this is in the 
works/is a parquet crate issue?)
   - Multiple Parquet files/unioning: out-of-the box a shredded and unshredded 
variant will probably fail to concatenate because of the differing layout 
(slash two shredded variants probably will too). 
   - SQL parsing/unparsing (mostly expressing the type name)
   - Use of statistics to do pruning. I think DataFusion automatically disables 
Parquet pruning when it sees a column reference to a nested type like a struct 
(and its notion of a `Column` doesn't support nesting, and its notion of 
Statistics is somewhat limited), so there are possibly a few battles on this 
one.
   - Probably more!
   
   > It is not clear to me if variant should be "built in" or if it should be 
an add on (for example, add a variant feature and a datafusion-variant crate)
   
   It's definitely easier to hard-code a type, although I think DataFusion will 
be better for allowing injected behaviour for more than just variant. Variant 
is definitely shinier, but UUIDs and geometry have many of the same problems. 
I'll put out a link to the vctrs R package ( 
https://vctrs.r-lib.org/articles/s3-vector.html ) which is a truly exceptional 
example of decentralized custom typing that supports custom printing, math, 
casting, and coercion for parameterized and unparameterized array types in R. 
Mostly this involves access to a registry of types in places that are currently 
stateless (can also be a static global variable like in Arrow C++ although I 
think the SessionContext is a better home).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to