hi Radu, I'll read the proposals in more detail when I can and make comments, but this has always been something of interest (see, e.g. [1]). The intent with the "C++ data frames" project that we've discussed (and I continue to labor towards, e.g. recent compute engine work is directly in service of this) has always been to be able to express computations on non-RAM-resident datasets [2]
As one initial high level point of discussion, I think what you're describing in these documents should probably be _new_ C++ classes and _new_ virtual interfaces, not an evolution of the current arrow::Table or arrow::Array/ChunkedArray classes. One practical path forward in terms of discussing implementation issues would be to draft header files proposing what these new class interfaces look like. - Wes [1]: https://issues.apache.org/jira/browse/ARROW-1329 [2]: https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit#heading=h.g70gstc7jq4h On Wed, Jun 17, 2020 at 2:48 PM Radu Teodorescu <radukay...@yahoo.com.invalid> wrote: > > Hi folks, > While I’ve been communicating with some members of this group in the past, > this is my first official post so please excuse/correct/guide me as needed. > > Logistics first: > I put most of the content of my proposals in google doc, but if more > appropriate, we can keep the conversation going by email. > Also the two proposals are pretty independent, so if needed we can break it > into two separate email threads, but for now I wanted to keep the spam low > > Actual proposals: > Virtual Array - The idea is to be able to handle arrow Tables where some of > the column data is not (yet) available in memory. For example a Table can map > to a parquet file, create VirtualArrays for each column chunk and only read > the actual content if and when the Array is touched. > Virtualize arrow Table > <https://docs.google.com/document/d/1qXSHSgMZtjNGzWrqDxoBisSoR6gbnRiEztnYihNGLsI/edit?usp=sharing> > Random Access - I find that “application state” for most large scale systems > is compatible with low level vectorized arrow representation and I propose a > number of API expansions that would enable thread safe data mutation and > efficient random access. > Arrow arrays random access > <https://docs.google.com/document/d/1tIsOhN6mfIAy6F8XRxeKRIqPBN0gKbcmrp2QJ4L3hJ8/edit?usp=sharing> > Please let me know what you think and what is the best course of action > moving forward. > Thank you > Radu