jorisvandenbossche opened a new issue, #39689: URL: https://github.com/apache/arrow/issues/39689
Follow-up discussion on the Arrow PyCapsule Protocol semantics added in https://github.com/apache/arrow/pull/37797 (and overview issue promoting it: https://github.com/apache/arrow/issues/39195). Current docs: https://arrow.apache.org/docs/dev/format/CDataInterface/PyCapsuleInterface.html This topic came up on the PR itself as well. I brought it up in https://github.com/apache/arrow/pull/37797#pullrequestreview-1642332563), and then we mostly discussed this (with eventually removing `__arrow_c_schema__` from the array) in the thread at https://github.com/apache/arrow/pull/37797#discussion_r1337257050. Rephrasing my question from in the PR discussion: > Should "data" objects also expose their schema through adding a `__arrow_c_schema__`? (in addition to `__arrow_c_array/stream__`, on the same object) So in the merged implementation of the protocol in pyarrow itself, we cleanly separated this: the Array/ChunkedArray/RecordBatch/Table classes have `__arrow_c_data/stream__`, and the DataType/Field/Schema classes have `__arrow_c_schema__`. But not all libraries have a clear concept of a "schema", or at least not as an accessible/dedicated Python object. For example, for two cases for which I have an open PR to add the protocol: a pandas.DataFrame does have a `.dtypes` attribute, but that's not a custom object that can expose the schema protocol (it's just a plain Series with data types as the values) (https://github.com/pandas-dev/pandas/pull/56587); and the interchange protocol DataFrame object only exposes column names, and you need to access a column itself to get the dtype, which then is a plain python tuple (so again not something to which the dunder could be added, and it is also not at the dataframe level) (https://github.com/data-apis/dataframe-api/pull/342). Personally I think it would be useful that one has the ability to inspect the schema of a "data" object, before asking for the actual data. For pyarrow objects you could check the `.type` or `.schema` attributes, and then get `__arrow_c_schema__`, but that gives again something library-specific in the middle, which we want to avoid. Summarizing the different arguments from our earlier thread about having `__arrow_c_schema__` on an array/stream object: Pro: * Library agnostic way to get the schema of an Arrow(Array/Stream)Exportable object, before getting the actual data * Reasons you might want to do this: * To be able to inspect the schema without data conversions, because getting the data is not necessarily zero-copy (for libraries that are not exactly 1:1 aligned with the Arrow format) * If you want to pass a `requested_schema`, you first need to know the schema you would get, before you can create your desired schema to pass to `__arrow_c_array/stream__` Con: * Being able to pass an array or stream where a schema is expected is a bit too loose (Quote from Antoine); e.g. it is weird that passing an Array or RecordBatch to `pa.schema(..)` would work and return a schema (although sidenote from myself: _if_ we want, we can still disallow this, and only accept objects that _only_ have `__arrow_c_schema__` in `pa.schema(..)`) * Getting the schema of a stream may involve I/O and is a fallible operation, so I think that's more reason to separate them (Quote from David) I think it would be nice if we can have some guidance for projects about what the best practice is. (right now I was planning to add `__arrow_c_schema__` in the above mentioned PRs because those projects don't have a "schema" object, but ideally I can follow a recommendation, so that consumer libraries can base their usage on such expectation of a schema being available or not) cc @wjones127 @pitrou @lidavidm and also cc @kylebarron and @WillAyd as I know you both have been experimenting with the capsule protocol and might have some user experience with it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
