paleolimbot opened a new issue, #66:
URL: https://github.com/apache/arrow-nanoarrow/issues/66

   After #65 we have built-in conversions for most Arrow types, including 
arbitrarily recursive nested struct and list types. There are a few rough edges 
remaining:
   
   - Converting streams with a fixed size (used in GDAL layer conversion, where 
the number of features is frequently known in advance) will fail for custom 
`to` targets (see reprex below)
   - Extension types just strip the extension type and convert the storage. 
This probably needs a registration step.
   - Converting streams with an unknown size currently falls back on a very 
slow "collect + rbind" approach. There should be a way to either implement 
growables or ALTREP + chunking to prevent two copies of the data + the slow 
rbind call.
   
   Reprex for extension types and S3 `convert_array()` methods:
   
   ``` r
   library(nanoarrow)
   
   # Extension types are not really supported
   ext_array <- as_nanoarrow_array(
     arrow::vctrs_extension_array(1:5)
   )
   convert_array(ext_array)
   #> Warning in convert_array.default(ext_array): Converting unknown extension
   #> arrow.r.vctrs{int32} as storage type
   #> [1] 1 2 3 4 5
   
   # Extensible targets are supported almost everywhere
   convert_array.some_custom_vctr <- function(array, to, ...) {
     vctrs::new_vctr(convert_array(array), class = "some_custom_vctr")
   }
   
   some_custom_vctr <- function() {
     vctrs::new_vctr(integer(), class = "some_custom_vctr")
   }
   
   array <- as_nanoarrow_array(1:10)
   struct_array <- as_nanoarrow_array(data.frame(x = 1:10))
   
   
   convert_array(array, some_custom_vctr())
   #> <some_custom_vctr[10]>
   #>  [1]  1  2  3  4  5  6  7  8  9 10
   convert_array(struct_array, tibble::tibble(x = some_custom_vctr()))
   #> # A tibble: 10 × 1
   #>             x
   #>    <sm_cstm_>
   #>  1          1
   #>  2          2
   #>  3          3
   #>  4          4
   #>  5          5
   #>  6          6
   #>  7          7
   #>  8          8
   #>  9          9
   #> 10         10
   convert_array_stream(
     as_nanoarrow_array_stream(data.frame(x = 1:10)),
     tibble::tibble(x = some_custom_vctr())
   )
   #> # A tibble: 10 × 1
   #>             x
   #>    <sm_cstm_>
   #>  1          1
   #>  2          2
   #>  3          3
   #>  4          4
   #>  5          5
   #>  6          6
   #>  7          7
   #>  8          8
   #>  9          9
   #> 10         10
   
   # ...except the version that materializes a stream to a known size
   convert_array_stream(
     as_nanoarrow_array_stream(data.frame(x = 1:10)),
     tibble::tibble(x = some_custom_vctr()),
     size = 10
   )
   #> Error in convert_array_stream(as_nanoarrow_array_stream(data.frame(x = 
1:10)), : Expected to materialize 10 values in batch 1 but materialized 0
   ```
   
   Reprex for a stream conversion that would benefit from a a better approach 
than `rbind()`:
   
   ``` r
   library(nanoarrow)
   
   reader <- arrow::RecordBatchReader$create(
     arrow::record_batch(x = letters),
     arrow::record_batch(x = LETTERS)
   )
   
   str(convert_array_stream(as_nanoarrow_array_stream(reader)))
   #> 'data.frame':    52 obs. of  1 variable:
   #>  $ x: chr  "a" "b" "c" "d" ...
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to