[ 
https://issues.apache.org/jira/browse/ARROW-15471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17487156#comment-17487156
 ] 

Joris Van den Bossche commented on ARROW-15471:
-----------------------------------------------

I don't know if it is exactly relevant here, but a few notes:

- In the C++ implementation, the field-level metadata (where the extension name 
and metadata is stored in serialized (IPC / C Data Interface schema) form) 
lives in the {{Field}} class. So an Array object itself cannot hold this 
metadata in C++. Thus, if you recreate an array from importing it from the C 
Data Interface, it is "expected" that the metadata is gone. And that's also the 
reason that if the array is part of a RecordBatch (which has a schema, with 
fields, potentially with metadata), that this metadata is preserved (in the 
schema of the RecordBatch)
- The "registration" of extension types enables that, eg while deserializing an 
IPC schema message, if we encounter extension type metadata,  we (meaning the 
C++ implementation) create an ExtensionArray with ExtensionType (and dropping 
the field metadata). If the name of the extension type is _not_ registered, we 
keep the actual storage array/type and preserve the metadata in the field.  
- As far as I know, ExtensionArray / ExtensionType is not yet exposed in the R 
bindings? I suppose that means you basically always get the storage array/type 
(I don't know what would happen if you actually have an extension type in a C++ 
RecordBatch and then accessing that from R, but I suppose this is quite 
difficult to achieve right now, since there are no extension types registered 
by default)
- The fact that R doesn't preserve field-level metadata in the schema seems 
like a separate issue? (I mean not needing full extension types support in R to 
fix it, but of course relevant here because of the fact that extension types 
are not yet supported in R and thus falls back to keeping this information in 
field metadata)

> [R] ExtensionType support in R
> ------------------------------
>
>                 Key: ARROW-15471
>                 URL: https://issues.apache.org/jira/browse/ARROW-15471
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: R
>            Reporter: Dewey Dunnington
>            Priority: Major
>
> In Python there is support for extension types that consists of a 
> registration step that defines functions to handle metadata serialization and 
> deserialization. In R, any extension name or metadata at the top level is 
> currently obliterated on import. To implement geometry reading and writing to 
> Parquet, IPC, and/or Feather, we will need to at the very least have the 
> extension name and metadata preserved (in R), and at best provide a 
> registration step to customize the behaviour of the resulting Array/DataType.
> Reprex for R:
> {code:R}
> # remotes::install_github("paleolimbot/narrow")
> library(narrow)
> carray <- as_narrow_array(1:5)
> carray$schema$metadata[["ARROW:extension:name"]] <- "extension name!"
> carray$schema$metadata[["ARROW:extension:metadata"]] <- "bananas"
> carray$schema$metadata[["something else"]] <- "more bananas"
> array <- from_narrow_array(carray, arrow::Array)
> carray2 <- as_narrow_array(array)
> carray2$schema$metadata[["ARROW:extension:name"]]
> #> NULL
> carray2$schema$metadata[["ARROW:extension:metadata"]]
> #> NULL
> carray2$schema$metadata[["something else"]]
> #> NULL
> {code}
> There is some discussion of that as a solution to ARROW-14378, including an 
> example of how pandas implements the 'interval' extension type (example 
> contributed by [~jorisvandenbossche]).
> For the Interval example, there are some different parts living in different 
> places:
> - The Arrow Extension Type definition for pandas' interval type: 
> https://github.com/pandas-dev/pandas/blob/fc6b441ba527ca32b460ae4f4f5a6802335497f9/pandas/core/arrays/_arrow_utils.py#L88-L136
> - The __from_arrow__ implementation (doing the conversion to arrow): 
> https://github.com/pandas-dev/pandas/blob/fc6b441ba527ca32b460ae4f4f5a6802335497f9/pandas/core/arrays/interval.py#L1405-L1455
> - The __from_arrow__ implementation (conversion arrow -> pandas): 
> https://github.com/pandas-dev/pandas/blob/fc6b441ba527ca32b460ae4f4f5a6802335497f9/pandas/core/dtypes/dtypes.py#L1227-L1255



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to