paleolimbot commented on issue #16116: URL: https://github.com/apache/datafusion/issues/16116#issuecomment-3073876353
Just listing a few specific places where I've had to integrate extension types outside of existing DataFusion mechanisms: - The `Signature` (i.e., how do you use the Signature mechanism to match a udf). One can also just use a kernel that matches anything and error in `return_field()` as well (and coerce the types yourself). If I'm remembering correctly the internal logic to coerce existing types isn't exposed in a way that makes it easy to that. - Casting: the `Cast` struct uses `DataType` (and anything that uses arrow-rs' casting will too). One can work around this by defining a UDF (e.g., `custom_cast(x, to)`) where `to` is a null scalar of the appropriate type. - Printing: the out-of-the box output you'll get is probably not what should be printed when you query a Parquet file with a variant column. (Could/should re-use the cast to string?) - CSV output (Could/should re-use the cast to string?) - The Parquet reader outputting the extension field (I assume this is in the works/is a parquet crate issue?) - Multiple Parquet files/unioning: out-of-the box a shredded and unshredded variant will probably fail to concatenate because of the differing layout (slash two shredded variants probably will too). - SQL parsing/unparsing (mostly expressing the type name) - Use of statistics to do pruning. I think DataFusion automatically disables Parquet pruning when it sees a column reference to a nested type like a struct (and its notion of a `Column` doesn't support nesting, and its notion of Statistics is somewhat limited), so there are possibly a few battles on this one. - Probably more! > It is not clear to me if variant should be "built in" or if it should be an add on (for example, add a variant feature and a datafusion-variant crate) It's definitely easier to hard-code a type, although I think DataFusion will be better for allowing injected behaviour for more than just variant. Variant is definitely shinier, but UUIDs and geometry have many of the same problems. I'll put out a link to the vctrs R package ( https://vctrs.r-lib.org/articles/s3-vector.html ) which is a truly exceptional example of decentralized custom typing that supports custom printing, math, casting, and coercion for parameterized and unparameterized array types in R. Mostly this involves access to a registry of types in places that are currently stateless (can also be a static global variable like in Arrow C++ although I think the SessionContext is a better home). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org