Wes & crew, Congratulations and thank you for the successful 1.0 rollout , it is certainly making a huge difference for my day job! Is it a good time now to revive the conversation below? (and https://github.com/apache/arrow/pull/7548 ) I have also gone ahead and released a prototype the covers some of the more hand wavy parts of my interface proposal (aka ways to compose arrays in a dataframe that controls the balance between fragmentation and buffer copying) - it is here: https://github.com/raduteo/framespaces/tree/master <https://github.com/raduteo/framespaces/tree/master> and it lacks in documentation but the basic data structures are robustly implemented and tested so if we find merits in the original PR: https://github.com/apache/arrow/pull/7548 <https://github.com/apache/arrow/pull/7548> , there should be a reasonable path for implementing most of it. Thank you Radu
> On Jun 25, 2020, at 3:10 PM, Radu Teodorescu <radukay...@yahoo.com.INVALID> > wrote: > > Understood and agreed > My proposal really addresses a number of mechanisms on layer 2 ( "Virtual" > tables) in your taxonomy (I can adjust interface names accordingly as part of > the review process). > One additional element I am proposing here is the ability to insert and > modify rows in a vectorized fashion - they follow the same mechanics as > “filter” which is effectively (i.e. row removal) > and I think they are quite important as an efficiently supported construct > (for things like data cleanup, data set updates etc.) > > I’m really looking forward to hear more of your thoughts (as well as anybody > else’s who is interested in this topic) > Radu > > >> On Jun 25, 2020, at 2:52 PM, Wes McKinney <wesmck...@gmail.com> wrote: >> >> hi Radu, >> >> It's going to be challenging for me to review in detail until after >> the 1.0.0 release is out, but in general I think there are 3 layers >> that we need to be talking about: >> >> * Materialized in-memory tables >> * "Virtual" tables, whose in-memory/not-in-memory semantics are not >> exposed -- permitting column selection, iteration as for execution of >> query engine operators (e.g. projection, filter, join, aggregate), and >> random access >> * "Data Frame API": a programming interface for expressing analytical >> operations on virtual tables. A data frame could be exported to >> materialized tables / record batches e.g. for writing to Parquet or >> IPC streams >> >> In principle the "Data Frame API" shouldn't need to know much about >> the first two layers, instead working with high level primitives and >> leaving the execution of those primitives to the layers below. Does >> this make sense? >> >> I think we should be pretty strict about separation of concerns >> between these three layers . I'll dig in in more detail sometime after >> July 4. >> >> Thanks >> Wes >> >> >> >> >> On Thu, Jun 25, 2020 at 11:50 AM Radu Teodorescu >> <radukay...@yahoo.com.invalid> wrote: >>> >>> Here it is as a pull request: >>> https://github.com/apache/arrow/pull/7548 >>> <https://github.com/apache/arrow/pull/7548> >>> >>> I hope this can be a starter for an active conversation diving into >>> specifics, and I look forward to contribute with more design and algorithm >>> ideas as well as concrete code. >>> >>>> On Jun 17, 2020, at 6:11 PM, Neal Richardson <neal.p.richard...@gmail.com> >>>> wrote: >>>> >>>> Maybe a draft pull request? If you put "WIP" in the pull request title, CI >>>> won't run builds on it, so it's suitable for rough outlines and collecting >>>> feedback. >>>> >>>> Neal >>>> >>>> On Wed, Jun 17, 2020 at 2:57 PM Radu Teodorescu >>>> <radukay...@yahoo.com.invalid> wrote: >>>> >>>>> Thank you Wes! >>>>> Yes, both proposals fit very nicely in your Data Frames vision, I see them >>>>> as deep dives on some specifics: >>>>> - the virtual array doc is more fluffy an probably if you agree with the >>>>> general concept, the next logical move is to put out some interfaces >>>>> indeed >>>>> - the random access doc goes into more details and I am curious what you >>>>> think about some of the concepts >>>>> >>>>> I will follow up shortly with some interfaces - do you prefer references >>>>> to a repo, inline them in an email or add them as comments to your doc? >>>>> >>>>> >>>>>> On Jun 17, 2020, at 4:26 PM, Wes McKinney <wesmck...@gmail.com> wrote: >>>>>> >>>>>> hi Radu, >>>>>> >>>>>> I'll read the proposals in more detail when I can and make comments, >>>>>> but this has always been something of interest (see, e.g. [1]). The >>>>>> intent with the "C++ data frames" project that we've discussed (and I >>>>>> continue to labor towards, e.g. recent compute engine work is directly >>>>>> in service of this) has always been to be able to express computations >>>>>> on non-RAM-resident datasets [2] >>>>>> >>>>>> As one initial high level point of discussion, I think what you're >>>>>> describing in these documents should probably be _new_ C++ classes and >>>>>> _new_ virtual interfaces, not an evolution of the current arrow::Table >>>>>> or arrow::Array/ChunkedArray classes. One practical path forward in >>>>>> terms of discussing implementation issues would be to draft header >>>>>> files proposing what these new class interfaces look like. >>>>>> >>>>>> - Wes >>>>>> >>>>>> [1]: https://issues.apache.org/jira/browse/ARROW-1329 >>>>>> [2]: >>>>> https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit#heading=h.g70gstc7jq4h >>>>>> >>>>>> On Wed, Jun 17, 2020 at 2:48 PM Radu Teodorescu >>>>>> <radukay...@yahoo.com.invalid> wrote: >>>>>>> >>>>>>> Hi folks, >>>>>>> While I’ve been communicating with some members of this group in the >>>>> past, this is my first official post so please excuse/correct/guide me as >>>>> needed. >>>>>>> >>>>>>> Logistics first: >>>>>>> I put most of the content of my proposals in google doc, but if more >>>>> appropriate, we can keep the conversation going by email. >>>>>>> Also the two proposals are pretty independent, so if needed we can >>>>> break it into two separate email threads, but for now I wanted to keep the >>>>> spam low >>>>>>> >>>>>>> Actual proposals: >>>>>>> Virtual Array - The idea is to be able to handle arrow Tables where >>>>> some of the column data is not (yet) available in memory. For example a >>>>> Table can map to a parquet file, create VirtualArrays for each column >>>>> chunk >>>>> and only read the actual content if and when the Array is touched. >>>>>>> Virtualize arrow Table < >>>>> https://docs.google.com/document/d/1qXSHSgMZtjNGzWrqDxoBisSoR6gbnRiEztnYihNGLsI/edit?usp=sharing >>>>>> >>>>>>> Random Access - I find that “application state” for most large scale >>>>> systems is compatible with low level vectorized arrow representation and I >>>>> propose a number of API expansions that would enable thread safe data >>>>> mutation and efficient random access. >>>>>>> Arrow arrays random access < >>>>> https://docs.google.com/document/d/1tIsOhN6mfIAy6F8XRxeKRIqPBN0gKbcmrp2QJ4L3hJ8/edit?usp=sharing >>>>>> >>>>>>> Please let me know what you think and what is the best course of action >>>>> moving forward. >>>>>>> Thank you >>>>>>> Radu >>>>> >>>>> >>> >