Dewey Dunnington created ARROW-15471:
----------------------------------------

             Summary: [R] ExtensionType support in R
                 Key: ARROW-15471
                 URL: https://issues.apache.org/jira/browse/ARROW-15471
             Project: Apache Arrow
          Issue Type: Improvement
          Components: R
            Reporter: Dewey Dunnington


In Python there is support for extension types that consists of a registration 
step that defines functions to handle metadata serialization and 
deserialization. In R, any extension name or metadata at the top level is 
currently obliterated on import. To implement geometry reading and writing to 
Parquet, IPC, and/or Feather, we will need to at the very least have the 
extension name and metadata preserved (in R), and at best provide a 
registration step to customize the behaviour of the resulting Array/DataType.

Reprex for R:

{code:R}
# remotes::install_github("paleolimbot/narrow")
library(narrow)

carray <- as_narrow_array(1:5)

carray$schema$metadata[["ARROW:extension:name"]] <- "extension name!"
carray$schema$metadata[["ARROW:extension:metadata"]] <- "bananas"
carray$schema$metadata[["something else"]] <- "more bananas"

array <- from_narrow_array(carray, arrow::Array)
carray2 <- as_narrow_array(array)

carray2$schema$metadata[["ARROW:extension:name"]]
#> NULL
carray2$schema$metadata[["ARROW:extension:metadata"]]
#> NULL
carray2$schema$metadata[["something else"]]
#> NULL
{code}


There is some discussion of that as a solution to ARROW-14378, including an 
example of how pandas implements the 'interval' extension type (example 
contributed by [~jorisvandenbossche]).

For the Interval example, there are some different parts living in different 
places:

- The Arrow Extension Type definition for pandas' interval type: 
https://github.com/pandas-dev/pandas/blob/fc6b441ba527ca32b460ae4f4f5a6802335497f9/pandas/core/arrays/_arrow_utils.py#L88-L136
- The __from_arrow__ implementation (doing the conversion to arrow): 
https://github.com/pandas-dev/pandas/blob/fc6b441ba527ca32b460ae4f4f5a6802335497f9/pandas/core/arrays/interval.py#L1405-L1455
- The __from_arrow__ implementation (conversion arrow -> pandas): 
https://github.com/pandas-dev/pandas/blob/fc6b441ba527ca32b460ae4f4f5a6802335497f9/pandas/core/dtypes/dtypes.py#L1227-L1255



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to