> I will have a closer look and comment most likely next week.
Thank you! > > Unfortunately, having code developed in external repositories increases the > complexity of importing that code back into the Apache project Not sure if > you’re interested in preemptively following the project’s style guide (file > naming, C++ code style, etc) but that would also help. I understand that challenge, my intent was to prove to myself and anyone else, that there is a satisfying implementation that provides the semantics and the performance levels I am referring to in my proposals. It is a reference implementation, but certainly not something that can be dropped in directly in its current form (for example, I am leaning quite heavily on c++14/17 and a bit of 20), but if the vision makes sense I would love to bring that into arrow. > On Wed, Aug 5, 2020 at 7:43 AM Radu Teodorescu <radukay...@yahoo.com.invalid> > wrote: > >> Wes & crew, >> Congratulations and thank you for the successful 1.0 rollout , it is >> certainly making a huge difference for my day job! >> Is it a good time now to revive the conversation below? (and >> https://github.com/apache/arrow/pull/7548 ) >> I have also gone ahead and released a prototype the covers some of the >> more hand wavy parts of my interface proposal (aka ways to compose arrays >> in a dataframe that controls the balance between fragmentation and buffer >> copying) - it is here: https://github.com/raduteo/framespaces/tree/master >> <https://github.com/raduteo/framespaces/tree/master> and it lacks in >> documentation but the basic data structures are robustly implemented and >> tested so if we find merits in the original PR: >> https://github.com/apache/arrow/pull/7548 < >> https://github.com/apache/arrow/pull/7548> , there should be a reasonable >> path for implementing most of it. >> >> Thank you >> Radu >> >> >>> On Jun 25, 2020, at 3:10 PM, Radu Teodorescu >> <radukay...@yahoo.com.INVALID> wrote: >>> >>> Understood and agreed >>> My proposal really addresses a number of mechanisms on layer 2 ( >> "Virtual" tables) in your taxonomy (I can adjust interface names >> accordingly as part of the review process). >>> One additional element I am proposing here is the ability to insert and >> modify rows in a vectorized fashion - they follow the same mechanics as >> “filter” which is effectively (i.e. row removal) >>> and I think they are quite important as an efficiently supported >> construct (for things like data cleanup, data set updates etc.) >>> >>> I’m really looking forward to hear more of your thoughts (as well as >> anybody else’s who is interested in this topic) >>> Radu >>> >>> >>>> On Jun 25, 2020, at 2:52 PM, Wes McKinney <wesmck...@gmail.com> wrote: >>>> >>>> hi Radu, >>>> >>>> It's going to be challenging for me to review in detail until after >>>> the 1.0.0 release is out, but in general I think there are 3 layers >>>> that we need to be talking about: >>>> >>>> * Materialized in-memory tables >>>> * "Virtual" tables, whose in-memory/not-in-memory semantics are not >>>> exposed -- permitting column selection, iteration as for execution of >>>> query engine operators (e.g. projection, filter, join, aggregate), and >>>> random access >>>> * "Data Frame API": a programming interface for expressing analytical >>>> operations on virtual tables. A data frame could be exported to >>>> materialized tables / record batches e.g. for writing to Parquet or >>>> IPC streams >>>> >>>> In principle the "Data Frame API" shouldn't need to know much about >>>> the first two layers, instead working with high level primitives and >>>> leaving the execution of those primitives to the layers below. Does >>>> this make sense? >>>> >>>> I think we should be pretty strict about separation of concerns >>>> between these three layers . I'll dig in in more detail sometime after >>>> July 4. >>>> >>>> Thanks >>>> Wes >>>> >>>> >>>> >>>> >>>> On Thu, Jun 25, 2020 at 11:50 AM Radu Teodorescu >>>> <radukay...@yahoo.com.invalid> wrote: >>>>> >>>>> Here it is as a pull request: >>>>> https://github.com/apache/arrow/pull/7548 < >> https://github.com/apache/arrow/pull/7548> >>>>> >>>>> I hope this can be a starter for an active conversation diving into >> specifics, and I look forward to contribute with more design and algorithm >> ideas as well as concrete code. >>>>> >>>>>> On Jun 17, 2020, at 6:11 PM, Neal Richardson < >> neal.p.richard...@gmail.com> wrote: >>>>>> >>>>>> Maybe a draft pull request? If you put "WIP" in the pull request >> title, CI >>>>>> won't run builds on it, so it's suitable for rough outlines and >> collecting >>>>>> feedback. >>>>>> >>>>>> Neal >>>>>> >>>>>> On Wed, Jun 17, 2020 at 2:57 PM Radu Teodorescu >>>>>> <radukay...@yahoo.com.invalid> wrote: >>>>>> >>>>>>> Thank you Wes! >>>>>>> Yes, both proposals fit very nicely in your Data Frames vision, I >> see them >>>>>>> as deep dives on some specifics: >>>>>>> - the virtual array doc is more fluffy an probably if you agree with >> the >>>>>>> general concept, the next logical move is to put out some interfaces >> indeed >>>>>>> - the random access doc goes into more details and I am curious what >> you >>>>>>> think about some of the concepts >>>>>>> >>>>>>> I will follow up shortly with some interfaces - do you prefer >> references >>>>>>> to a repo, inline them in an email or add them as comments to your >> doc? >>>>>>> >>>>>>> >>>>>>>> On Jun 17, 2020, at 4:26 PM, Wes McKinney <wesmck...@gmail.com> >> wrote: >>>>>>>> >>>>>>>> hi Radu, >>>>>>>> >>>>>>>> I'll read the proposals in more detail when I can and make comments, >>>>>>>> but this has always been something of interest (see, e.g. [1]). The >>>>>>>> intent with the "C++ data frames" project that we've discussed (and >> I >>>>>>>> continue to labor towards, e.g. recent compute engine work is >> directly >>>>>>>> in service of this) has always been to be able to express >> computations >>>>>>>> on non-RAM-resident datasets [2] >>>>>>>> >>>>>>>> As one initial high level point of discussion, I think what you're >>>>>>>> describing in these documents should probably be _new_ C++ classes >> and >>>>>>>> _new_ virtual interfaces, not an evolution of the current >> arrow::Table >>>>>>>> or arrow::Array/ChunkedArray classes. One practical path forward in >>>>>>>> terms of discussing implementation issues would be to draft header >>>>>>>> files proposing what these new class interfaces look like. >>>>>>>> >>>>>>>> - Wes >>>>>>>> >>>>>>>> [1]: https://issues.apache.org/jira/browse/ARROW-1329 >>>>>>>> [2]: >>>>>>> >> https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit#heading=h.g70gstc7jq4h >>>>>>>> >>>>>>>> On Wed, Jun 17, 2020 at 2:48 PM Radu Teodorescu >>>>>>>> <radukay...@yahoo.com.invalid> wrote: >>>>>>>>> >>>>>>>>> Hi folks, >>>>>>>>> While I’ve been communicating with some members of this group in >> the >>>>>>> past, this is my first official post so please excuse/correct/guide >> me as >>>>>>> needed. >>>>>>>>> >>>>>>>>> Logistics first: >>>>>>>>> I put most of the content of my proposals in google doc, but if >> more >>>>>>> appropriate, we can keep the conversation going by email. >>>>>>>>> Also the two proposals are pretty independent, so if needed we can >>>>>>> break it into two separate email threads, but for now I wanted to >> keep the >>>>>>> spam low >>>>>>>>> >>>>>>>>> Actual proposals: >>>>>>>>> Virtual Array - The idea is to be able to handle arrow Tables where >>>>>>> some of the column data is not (yet) available in memory. For >> example a >>>>>>> Table can map to a parquet file, create VirtualArrays for each >> column chunk >>>>>>> and only read the actual content if and when the Array is touched. >>>>>>>>> Virtualize arrow Table < >>>>>>> >> https://docs.google.com/document/d/1qXSHSgMZtjNGzWrqDxoBisSoR6gbnRiEztnYihNGLsI/edit?usp=sharing >>>>>>>> >>>>>>>>> Random Access - I find that “application state” for most large >> scale >>>>>>> systems is compatible with low level vectorized arrow representation >> and I >>>>>>> propose a number of API expansions that would enable thread safe data >>>>>>> mutation and efficient random access. >>>>>>>>> Arrow arrays random access < >>>>>>> >> https://docs.google.com/document/d/1tIsOhN6mfIAy6F8XRxeKRIqPBN0gKbcmrp2QJ4L3hJ8/edit?usp=sharing >>>>>>>> >>>>>>>>> Please let me know what you think and what is the best course of >> action >>>>>>> moving forward. >>>>>>>>> Thank you >>>>>>>>> Radu >>>>>>> >>>>>>> >>>>> >>> >> >>