paleolimbot opened a new issue, #219:
URL: https://github.com/apache/arrow-nanoarrow/issues/219

   First noted in #66, it's fairly common to attempt a conversion of a stream 
with more than one batch to a data.frame. Currently this will convert one chunk 
at a time and `rbind()` or `c()` everything together. This is slow and requires 
at least twice the memory.
   
   Related is the the "fixed size" converter path, which does a "preallocate + 
fill"; however this requires knowing the exact size before starting to pull 
batches which is almost never the case. The first bit could be solved by 
implementing the requisite copying functions to allow the pre-allocated vectors 
to be growable; however, that wouldn't allow for the individual components to 
be ALTREP...everything would be fully materialized.
   
   The Arrow package handles this by a rather complicated implementation that 
has excellent type coverage: most chunked arrays can be wrapped in an ALTREP 
vector. Because we don't have Arrow C++ at our disposal, this is not practical 
here.
   
   Somewhere in the middle is implementing a generic ALTREP vector of a 
concatenation: The "data" would be a `list()` of type-checked vectors (that 
could themselves be ALTREP); the ALTREP class would implement element access 
using something like the `ChunkResolver` sitting in Arrow C++.
   
   Independently of that, implementing ALTREP conversion for a single 
`ArrowArray` -- particularly the ones that can share memory like int32/double 
with no nulls -- would reduce another copy. For types that can't share memory, 
lazily converting via the `ArrowArrayViewGet()` functions is also an option.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to