hi Radu,

I'll read the proposals in more detail when I can and make comments,
but this has always been something of interest (see, e.g. [1]). The
intent with the "C++ data frames" project that we've discussed (and I
continue to labor towards, e.g. recent compute engine work is directly
in service of this) has always been to be able to express computations
on non-RAM-resident datasets [2]

As one initial high level point of discussion, I think what you're
describing in these documents should probably be _new_ C++ classes and
_new_ virtual interfaces, not an evolution of the current arrow::Table
or arrow::Array/ChunkedArray classes. One practical path forward in
terms of discussing implementation issues would be to draft header
files proposing what these new class interfaces look like.

- Wes

[1]: https://issues.apache.org/jira/browse/ARROW-1329
[2]: 
https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit#heading=h.g70gstc7jq4h

On Wed, Jun 17, 2020 at 2:48 PM Radu Teodorescu
<radukay...@yahoo.com.invalid> wrote:
>
> Hi folks,
> While I’ve been communicating with some members of this group in the past, 
> this is my first official post so please excuse/correct/guide me as needed.
>
> Logistics first:
> I put most of the content of my proposals in google doc, but if more 
> appropriate, we can keep the conversation going by email.
> Also the two proposals are pretty independent, so if needed we can break it 
> into two separate email threads, but for now I wanted to keep the spam low
>
> Actual proposals:
> Virtual Array - The idea is to be able to handle arrow Tables where some of 
> the column data is not (yet) available in memory. For example a Table can map 
> to a parquet file, create VirtualArrays for each column chunk and only read 
> the actual content if and when the Array is touched.
> Virtualize arrow Table 
> <https://docs.google.com/document/d/1qXSHSgMZtjNGzWrqDxoBisSoR6gbnRiEztnYihNGLsI/edit?usp=sharing>
> Random Access - I find that “application state” for most large scale systems 
> is compatible with low level vectorized arrow representation and I propose a 
> number of API expansions that would enable thread safe data mutation and 
> efficient random access.
> Arrow arrays random access 
> <https://docs.google.com/document/d/1tIsOhN6mfIAy6F8XRxeKRIqPBN0gKbcmrp2QJ4L3hJ8/edit?usp=sharing>
> Please let me know what you think and what is the best course of action 
> moving forward.
> Thank you
> Radu

Reply via email to