[ https://issues.apache.org/jira/browse/ARROW-15471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17487156#comment-17487156 ]
Joris Van den Bossche commented on ARROW-15471: ----------------------------------------------- I don't know if it is exactly relevant here, but a few notes: - In the C++ implementation, the field-level metadata (where the extension name and metadata is stored in serialized (IPC / C Data Interface schema) form) lives in the {{Field}} class. So an Array object itself cannot hold this metadata in C++. Thus, if you recreate an array from importing it from the C Data Interface, it is "expected" that the metadata is gone. And that's also the reason that if the array is part of a RecordBatch (which has a schema, with fields, potentially with metadata), that this metadata is preserved (in the schema of the RecordBatch) - The "registration" of extension types enables that, eg while deserializing an IPC schema message, if we encounter extension type metadata, we (meaning the C++ implementation) create an ExtensionArray with ExtensionType (and dropping the field metadata). If the name of the extension type is _not_ registered, we keep the actual storage array/type and preserve the metadata in the field. - As far as I know, ExtensionArray / ExtensionType is not yet exposed in the R bindings? I suppose that means you basically always get the storage array/type (I don't know what would happen if you actually have an extension type in a C++ RecordBatch and then accessing that from R, but I suppose this is quite difficult to achieve right now, since there are no extension types registered by default) - The fact that R doesn't preserve field-level metadata in the schema seems like a separate issue? (I mean not needing full extension types support in R to fix it, but of course relevant here because of the fact that extension types are not yet supported in R and thus falls back to keeping this information in field metadata) > [R] ExtensionType support in R > ------------------------------ > > Key: ARROW-15471 > URL: https://issues.apache.org/jira/browse/ARROW-15471 > Project: Apache Arrow > Issue Type: Improvement > Components: R > Reporter: Dewey Dunnington > Priority: Major > > In Python there is support for extension types that consists of a > registration step that defines functions to handle metadata serialization and > deserialization. In R, any extension name or metadata at the top level is > currently obliterated on import. To implement geometry reading and writing to > Parquet, IPC, and/or Feather, we will need to at the very least have the > extension name and metadata preserved (in R), and at best provide a > registration step to customize the behaviour of the resulting Array/DataType. > Reprex for R: > {code:R} > # remotes::install_github("paleolimbot/narrow") > library(narrow) > carray <- as_narrow_array(1:5) > carray$schema$metadata[["ARROW:extension:name"]] <- "extension name!" > carray$schema$metadata[["ARROW:extension:metadata"]] <- "bananas" > carray$schema$metadata[["something else"]] <- "more bananas" > array <- from_narrow_array(carray, arrow::Array) > carray2 <- as_narrow_array(array) > carray2$schema$metadata[["ARROW:extension:name"]] > #> NULL > carray2$schema$metadata[["ARROW:extension:metadata"]] > #> NULL > carray2$schema$metadata[["something else"]] > #> NULL > {code} > There is some discussion of that as a solution to ARROW-14378, including an > example of how pandas implements the 'interval' extension type (example > contributed by [~jorisvandenbossche]). > For the Interval example, there are some different parts living in different > places: > - The Arrow Extension Type definition for pandas' interval type: > https://github.com/pandas-dev/pandas/blob/fc6b441ba527ca32b460ae4f4f5a6802335497f9/pandas/core/arrays/_arrow_utils.py#L88-L136 > - The __from_arrow__ implementation (doing the conversion to arrow): > https://github.com/pandas-dev/pandas/blob/fc6b441ba527ca32b460ae4f4f5a6802335497f9/pandas/core/arrays/interval.py#L1405-L1455 > - The __from_arrow__ implementation (conversion arrow -> pandas): > https://github.com/pandas-dev/pandas/blob/fc6b441ba527ca32b460ae4f4f5a6802335497f9/pandas/core/dtypes/dtypes.py#L1227-L1255 -- This message was sent by Atlassian Jira (v8.20.1#820001)