Arrow arrays don't have metadata, so if you want to pass metadata around you should at least add a hook for columns as well.
Regards Antoine. Le 09/05/2019 à 18:10, Joris Van den Bossche a écrit : > An additional question might be at which "level" to provide such a hook to > third-party packages: I proposed for Array, but what for chunked arrays, > columns or tables? Maybe at least returning a chunked array should also be > allowed. > > Op do 9 mei 2019 om 18:06 schreef Joris Van den Bossche < > [email protected]>: > >> The signature I had in mind is something like: >> >> def __arrow_array__(self, type : pyarrow.DataType=None) -> pyarrow.Array: >> >> where the function returns a pyarrow.Array, and takes an optional data >> type (in case there are multiple ways to convert to a pyarrow Array, and >> what can be passed by the user in the type argument in pyarrow.array(..) or >> in a specified schema). >> >> But, the above is only for a one way path of custom array to Arrow array, >> and not enough for a full roundtrip. >> >> For a full roundtrip in case of a pandas DataFrame, we will still need to >> save information in metadata independently from __arrow_array__ and have >> custom code in pyarrow to deal with pandas DataFrames (of which there is >> already a lot). I mentioned this briefly in >> https://github.com/pandas-dev/pandas/issues/20612/#issuecomment-489649556 >> / https://issues.apache.org/jira/browse/ARROW-2428, but one option could >> be to save the name of the pandas extension dtype in the pandas_metadata of >> an arrow Table (just as already happens for currently supported types), and >> when exporting back to pandas with to_pandas pyarrow could check if this >> extension dtype name is registered with pandas and if so, call a method >> there to construct it. >> >> Joris >> >> Op do 9 mei 2019 om 17:38 schreef Antoine Pitrou <[email protected]>: >> >>> >>> Hi Joris, >>> >>> Do you have a signature for __arrow_array__ method in mind? >>> >>> For example, let's say you want to roundtrip ExtensionArrays or other >>> third-party data through Arrow. How do you preserve the required >>> metadata? >>> >>> Regards >>> >>> Antoine. >>> >>> >>> Le 09/05/2019 à 13:29, Joris Van den Bossche a écrit : >>>> Hi all, >>>> >>>> I want to propose an interface to allow custom array objects in Python >>> to >>>> define how they should be converted to Arrow arrays (e.g. in >>>> pyarrow.array(..)). I opened >>>> https://issues.apache.org/jira/browse/ARROW-5271 for this. >>>> This would be similar to the numpy __array__ protocol (so we could eg >>> call >>>> it __arrow_array__). >>>> Feedback / discussion very welcome! >>>> >>>> I am coming to this discussion specifically from the point of view of >>>> pandas ExtensionArrays (github issue for this: >>>> >>> https://github.com/pandas-dev/pandas/issues/20612/#issuecomment-489649556 >>> ). >>>> Such a protocol would, for example, make it possible that pandas users >>> can >>>> save DataFrames with ExtensionArrays (eg the nullable integers) to >>> parquet, >>>> without the need for pyarrow to know about all those possible different >>>> extension arrays. This would also be useful for projects extending >>> pandas >>>> such as GeoPandas <https://github.com/geopandas/geopandas> and Fletcher >>>> <https://github.com/xhochy/fletcher>. >>>> But I suppose it could also be of interest more in general of other >>>> array-like / pandas-like projects that want to interface with arrow. >>>> >>>> Sidenote: for the pandas case, I want to look a the full roundtrip, so >>> also >>>> the conversion back from an arrow Table to DataFrame. For that aspect >>> there >>>> is https://issues.apache.org/jira/browse/ARROW-2428, but this is much >>> more >>>> specific to pandas and its ExtensionArrays. >>>> >>>> Regards, >>>> Joris >>>> >>> >> >
