wjones127 commented on issue #42069: URL: https://github.com/apache/arrow/issues/42069#issuecomment-2166130179
## An Arrow extension type? In the near term, I think this would make a good Arrow extension type. This would be: ``` struct< metadata: dictionary<binary>, data: binary > ``` The metadata will usually be a single binary shared across all rows, but could be multiple. (Multiple might happen if two different batches are concatenated together, for example.) Either dictionary or REE encoded array would be appropriate. The data could be either binary, large binary, or binary view. Binary view isn’t widely supported right now, but could be very useful for this data type. This is because sub-objects can be sliced out of variants. From the spec [^1]: > Another motivation for the representation is that (aside from metadata) each inner Variant value is contiguous and self-contained. For example, in a Variant containing an Array of Variant values, the representation of an inner Variant value, when paired with the metadata of the full variant, is itself a valid Variant. [^1]: https://github.com/apache/spark/blob/master/common/variant/README.md ## Where could this be useful? A few immediate places I think this extension type could be useful: - Roundtrip variant Arrow ↔ Spark - Spark Connect (and any ADBC connector to that) would benefit from this - Extension type in PyArrow, roundtrip PySpark ↔ PyArrow - DataFusion function library (I’m experimenting with that now) * There's been substantial interest in DataFusion community for a way to handle semi-structured data efficiently. ## Extension type pitfalls The main pitfall of using an extension type for this is the storage type is meaningless to users. They need to have special libraries to interpret the bytes if pulled into a system that doesn't understand the variant extension type. In addition, most existing Arrow systems I've worked with don't have a way to customize how extension arrays are printed. I think this is something we should fix. A reasonable workaround in the meantime is providing functions that convert these back to JSON strings for the purpose of printing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org