[I] [Python] Arrow PyCapsule Protocol: standard way to get the schema of a "data" (array of stream) object? [arrow]

via GitHub Thu, 18 Jan 2024 06:34:05 -0800


jorisvandenbossche opened a new issue, #39689:
URL: https://github.com/apache/arrow/issues/39689


   Follow-up discussion on the Arrow PyCapsule Protocol semantics added in 
https://github.com/apache/arrow/pull/37797 (and overview issue promoting it: 
https://github.com/apache/arrow/issues/39195). Current docs: 
https://arrow.apache.org/docs/dev/format/CDataInterface/PyCapsuleInterface.html
   
   This topic came up on the PR itself as well. I brought it up in 
https://github.com/apache/arrow/pull/37797#pullrequestreview-1642332563), and 
then we mostly discussed this (with eventually removing `__arrow_c_schema__` 
from the array) in the thread at 
https://github.com/apache/arrow/pull/37797#discussion_r1337257050.  
   Rephrasing my question from in the PR discussion:
   
   > Should "data" objects also expose their schema through adding a 
`__arrow_c_schema__`? (in addition to `__arrow_c_array/stream__`, on the same 
object)
   
   So in the merged implementation of the protocol in pyarrow itself, we 
cleanly separated this: the Array/ChunkedArray/RecordBatch/Table classes have 
`__arrow_c_data/stream__`, and the DataType/Field/Schema classes have 
`__arrow_c_schema__`.
   
   But not all libraries have a clear concept of a "schema", or at least not as 
an accessible/dedicated Python object. 
   
   For example, for two cases for which I have an open PR to add the protocol: 
a pandas.DataFrame does have a `.dtypes` attribute, but that's not a custom 
object that can expose the schema protocol (it's just a plain Series with data 
types as the values) (https://github.com/pandas-dev/pandas/pull/56587); and the 
interchange protocol DataFrame object only exposes column names, and you need 
to access a column itself to get the dtype, which then is a plain python tuple 
(so again not something to which the dunder could be added, and it is also not 
at the dataframe level) (https://github.com/data-apis/dataframe-api/pull/342). 
   
   Personally I think it would be useful that one has the ability to inspect 
the schema of a "data" object, before asking for the actual data. For pyarrow 
objects you could check the `.type` or `.schema` attributes, and then get 
`__arrow_c_schema__`, but that gives again something library-specific in the 
middle, which we want to avoid.
   
   Summarizing the different arguments from our earlier thread about having 
`__arrow_c_schema__` on an array/stream object:
   
   Pro:
   
   * Library agnostic way to get the schema of an Arrow(Array/Stream)Exportable 
object, before getting the actual data
   * Reasons you might want to do this:
     * To be able to inspect the schema without data conversions, because 
getting the data is not necessarily zero-copy (for libraries that are not 
exactly 1:1 aligned with the Arrow format)
     * If you want to pass a `requested_schema`, you first need to know the 
schema you would get, before you can create your desired schema to pass to 
`__arrow_c_array/stream__`
   
   Con:
   
   * Being able to pass an array or stream where a schema is expected is a bit 
too loose (Quote from Antoine); e.g. it is weird that passing an Array or 
RecordBatch to `pa.schema(..)` would work and return a schema (although 
sidenote from myself: _if_ we want, we can still disallow this, and only accept 
objects that _only_ have `__arrow_c_schema__` in `pa.schema(..)`)
   * Getting the schema of a stream may involve I/O and is a fallible 
operation, so I think that's more reason to separate them (Quote from David)
   
   I think it would be nice if we can have some guidance for projects about 
what the best practice is. 
   (right now I was planning to add `__arrow_c_schema__` in the above mentioned 
PRs because those projects don't have a "schema" object, but ideally I can 
follow a recommendation, so that consumer libraries can base their usage on 
such expectation of a schema being available or not)
   
   cc @wjones127 @pitrou @lidavidm 
   
   and also cc @kylebarron and @WillAyd as I know you both have been 
experimenting with the capsule protocol and might have some user experience 
with it
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [Python] Arrow PyCapsule Protocol: standard way to get the schema of a "data" (array of stream) object? [arrow]

Reply via email to