Re: Proposal for arrow DataFrame low level structure and primitives (Was: Two proposals for expanding arrow Table API (virtual arrays and random access))

Radu Teodorescu Wed, 05 Aug 2020 06:16:39 -0700


> I will have a closer look and comment most likely next week.


Thank you!

> 
> Unfortunately, having code developed in external repositories increases the
> complexity of importing that code back into the Apache project  Not sure if
> you’re interested in preemptively following the project’s style guide (file
> naming, C++ code style, etc) but that would also help.

I understand that challenge, my intent was to prove to myself and anyone else, 
that there is a satisfying implementation that provides the semantics and the 
performance levels I am referring to in my proposals. It is a reference 
implementation, but certainly not something that can be dropped in directly in 
its current form (for example, I am leaning quite heavily on c++14/17 and a bit 
of 20), but if the vision makes sense I would love to bring that into arrow.

> On Wed, Aug 5, 2020 at 7:43 AM Radu Teodorescu <[email protected]>
> wrote:
> 
>> Wes & crew,
>> Congratulations and thank you for the successful 1.0 rollout , it is
>> certainly making a huge difference for my day job!
>> Is it a good time now to revive the conversation below? (and
>> https://github.com/apache/arrow/pull/7548 )
>> I have also gone ahead and released a prototype the covers some of the
>> more hand wavy parts of my interface proposal (aka ways to compose arrays
>> in a dataframe that controls the balance between fragmentation and buffer
>> copying) - it is here: https://github.com/raduteo/framespaces/tree/master
>> <https://github.com/raduteo/framespaces/tree/master> and it lacks in
>> documentation but the basic data structures are robustly implemented and
>> tested so if we find merits in the original PR:
>> https://github.com/apache/arrow/pull/7548 <
>> https://github.com/apache/arrow/pull/7548> , there should be a reasonable
>> path for implementing most of it.
>> 
>> Thank you
>> Radu
>> 
>> 
>>> On Jun 25, 2020, at 3:10 PM, Radu Teodorescu
>> <[email protected]> wrote:
>>> 
>>> Understood and agreed
>>> My proposal really addresses a number of mechanisms on layer 2 (
>> "Virtual" tables) in your taxonomy (I can adjust interface names
>> accordingly as part of the review process).
>>> One additional element I am proposing here is the ability to insert and
>> modify rows in a vectorized fashion - they follow the same mechanics as
>> “filter” which is effectively (i.e. row removal)
>>> and I think they are quite important as an efficiently supported
>> construct (for things like data cleanup, data set updates etc.)
>>> 
>>> I’m really looking forward to hear more of your thoughts (as well as
>> anybody else’s who is interested in this topic)
>>> Radu
>>> 
>>> 
>>>> On Jun 25, 2020, at 2:52 PM, Wes McKinney <[email protected]> wrote:
>>>> 
>>>> hi Radu,
>>>> 
>>>> It's going to be challenging for me to review in detail until after
>>>> the 1.0.0 release is out, but in general I think there are 3 layers
>>>> that we need to be talking about:
>>>> 
>>>> * Materialized in-memory tables
>>>> * "Virtual" tables, whose in-memory/not-in-memory semantics are not
>>>> exposed -- permitting column selection, iteration as for execution of
>>>> query engine operators (e.g. projection, filter, join, aggregate), and
>>>> random access
>>>> * "Data Frame API": a programming interface for expressing analytical
>>>> operations on virtual tables. A data frame could be exported to
>>>> materialized tables / record batches e.g. for writing to Parquet or
>>>> IPC streams
>>>> 
>>>> In principle the "Data Frame API" shouldn't need to know much about
>>>> the first two layers, instead working with high level primitives and
>>>> leaving the execution of those primitives to the layers below. Does
>>>> this make sense?
>>>> 
>>>> I think we should be pretty strict about separation of concerns
>>>> between these three layers . I'll dig in in more detail sometime after
>>>> July 4.
>>>> 
>>>> Thanks
>>>> Wes
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Thu, Jun 25, 2020 at 11:50 AM Radu Teodorescu
>>>> <[email protected]> wrote:
>>>>> 
>>>>> Here it is as a pull request:
>>>>> https://github.com/apache/arrow/pull/7548 <
>> https://github.com/apache/arrow/pull/7548>
>>>>> 
>>>>> I hope this can be a starter for an active conversation diving into
>> specifics, and I look forward to contribute with more design and algorithm
>> ideas as well as concrete code.
>>>>> 
>>>>>> On Jun 17, 2020, at 6:11 PM, Neal Richardson <
>> [email protected]> wrote:
>>>>>> 
>>>>>> Maybe a draft pull request? If you put "WIP" in the pull request
>> title, CI
>>>>>> won't run builds on it, so it's suitable for rough outlines and
>> collecting
>>>>>> feedback.
>>>>>> 
>>>>>> Neal
>>>>>> 
>>>>>> On Wed, Jun 17, 2020 at 2:57 PM Radu Teodorescu
>>>>>> <[email protected]> wrote:
>>>>>> 
>>>>>>> Thank you Wes!
>>>>>>> Yes, both proposals fit very nicely in your Data Frames vision, I
>> see them
>>>>>>> as deep dives on some specifics:
>>>>>>> - the virtual array doc is more fluffy an probably if you agree with
>> the
>>>>>>> general concept, the next logical move is to put out some interfaces
>> indeed
>>>>>>> - the random access doc goes into more details and I am curious what
>> you
>>>>>>> think about some of the concepts
>>>>>>> 
>>>>>>> I will follow up shortly with some interfaces - do you prefer
>> references
>>>>>>> to a repo, inline them in an email or add them as comments to your
>> doc?
>>>>>>> 
>>>>>>> 
>>>>>>>> On Jun 17, 2020, at 4:26 PM, Wes McKinney <[email protected]>
>> wrote:
>>>>>>>> 
>>>>>>>> hi Radu,
>>>>>>>> 
>>>>>>>> I'll read the proposals in more detail when I can and make comments,
>>>>>>>> but this has always been something of interest (see, e.g. [1]). The
>>>>>>>> intent with the "C++ data frames" project that we've discussed (and
>> I
>>>>>>>> continue to labor towards, e.g. recent compute engine work is
>> directly
>>>>>>>> in service of this) has always been to be able to express
>> computations
>>>>>>>> on non-RAM-resident datasets [2]
>>>>>>>> 
>>>>>>>> As one initial high level point of discussion, I think what you're
>>>>>>>> describing in these documents should probably be _new_ C++ classes
>> and
>>>>>>>> _new_ virtual interfaces, not an evolution of the current
>> arrow::Table
>>>>>>>> or arrow::Array/ChunkedArray classes. One practical path forward in
>>>>>>>> terms of discussing implementation issues would be to draft header
>>>>>>>> files proposing what these new class interfaces look like.
>>>>>>>> 
>>>>>>>> - Wes
>>>>>>>> 
>>>>>>>> [1]: https://issues.apache.org/jira/browse/ARROW-1329
>>>>>>>> [2]:
>>>>>>> 
>> https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit#heading=h.g70gstc7jq4h
>>>>>>>> 
>>>>>>>> On Wed, Jun 17, 2020 at 2:48 PM Radu Teodorescu
>>>>>>>> <[email protected]> wrote:
>>>>>>>>> 
>>>>>>>>> Hi folks,
>>>>>>>>> While I’ve been communicating with some members of this group in
>> the
>>>>>>> past, this is my first official post so please excuse/correct/guide
>> me as
>>>>>>> needed.
>>>>>>>>> 
>>>>>>>>> Logistics first:
>>>>>>>>> I put most of the content of my proposals in google doc, but if
>> more
>>>>>>> appropriate, we can keep the conversation going by email.
>>>>>>>>> Also the two proposals are pretty independent, so if needed we can
>>>>>>> break it into two separate email threads, but for now I wanted to
>> keep the
>>>>>>> spam low
>>>>>>>>> 
>>>>>>>>> Actual proposals:
>>>>>>>>> Virtual Array - The idea is to be able to handle arrow Tables where
>>>>>>> some of the column data is not (yet) available in memory. For
>> example a
>>>>>>> Table can map to a parquet file, create VirtualArrays for each
>> column chunk
>>>>>>> and only read the actual content if and when the Array is touched.
>>>>>>>>> Virtualize arrow Table <
>>>>>>> 
>> https://docs.google.com/document/d/1qXSHSgMZtjNGzWrqDxoBisSoR6gbnRiEztnYihNGLsI/edit?usp=sharing
>>>>>>>> 
>>>>>>>>> Random Access - I find that “application state” for most large
>> scale
>>>>>>> systems is compatible with low level vectorized arrow representation
>> and I
>>>>>>> propose a number of API expansions that would enable thread safe data
>>>>>>> mutation and efficient random access.
>>>>>>>>> Arrow arrays random access <
>>>>>>> 
>> https://docs.google.com/document/d/1tIsOhN6mfIAy6F8XRxeKRIqPBN0gKbcmrp2QJ4L3hJ8/edit?usp=sharing
>>>>>>>> 
>>>>>>>>> Please let me know what you think and what is the best course of
>> action
>>>>>>> moving forward.
>>>>>>>>> Thank you
>>>>>>>>> Radu
>>>>>>> 
>>>>>>> 
>>>>> 
>>> 
>> 
>>

Re: Proposal for arrow DataFrame low level structure and primitives (Was: Two proposals for expanding arrow Table API (virtual arrays and random access))

Reply via email to