wjones127 commented on issue #42069:
URL: https://github.com/apache/arrow/issues/42069#issuecomment-2166130179

   ## An Arrow extension type?
   
   In the near term, I think this would make a good Arrow extension type. This 
would be:
   
   ```
   struct<
     metadata: dictionary<binary>,
     data: binary
   >
   ```
   
   The metadata will usually be a single binary shared across all rows, but 
could be multiple. (Multiple might happen if two different batches are 
concatenated together, for example.) Either dictionary or REE encoded array 
would be appropriate.
   
   The data could be either binary, large binary, or binary view.
   
   Binary view isn’t widely supported right now, but could be very useful for 
this data type. This is because sub-objects can be sliced out of variants. From 
the spec [^1]:
   
   > Another motivation for the representation is that (aside from metadata) 
each inner Variant value is contiguous and self-contained. For example, in a 
Variant containing an Array of Variant values, the representation of an inner 
Variant value, when paired with the metadata of the full variant, is itself a 
valid Variant.
   
   [^1]: https://github.com/apache/spark/blob/master/common/variant/README.md
   
   ## Where could this be useful?
   
   A few immediate places I think this extension type could be useful:
   
   - Roundtrip variant Arrow ↔ Spark
       - Spark Connect (and any ADBC connector to that) would benefit from this
   - Extension type in PyArrow, roundtrip PySpark ↔ PyArrow
   - DataFusion function library (I’m experimenting with that now)
     * There's been substantial interest in DataFusion community for a way to 
handle semi-structured data efficiently.
   
   ## Extension type pitfalls
   
   The main pitfall of using an extension type for this is the storage type is 
meaningless to users. They need to have special libraries to interpret the 
bytes if pulled into a system that doesn't understand the variant extension 
type.
   
   In addition, most existing Arrow systems I've worked with don't have a way 
to customize how extension arrays are printed. I think this is something we 
should fix. A reasonable workaround in the meantime is providing functions that 
convert these back to JSON strings for the purpose of printing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to