Re: Proposal for arrow DataFrame low level structure and primitives (Was: Two proposals for expanding arrow Table API (virtual arrays and random access))

Radu Teodorescu Thu, 25 Jun 2020 12:11:42 -0700

Understood and agreed
My proposal really addresses a number of mechanisms on layer 2 ( "Virtual" 
tables) in your taxonomy (I can adjust interface names accordingly as part of 
the review process).
One additional element I am proposing here is the ability to insert and modify 
rows in a vectorized fashion - they follow the same mechanics as “filter” which 
is effectively (i.e. row removal) 
and I think they are quite important as an efficiently supported construct (for 
things like data cleanup, data set updates etc.)


I’m really looking forward to hear more of your thoughts (as well as anybody 
else’s who is interested in this topic)
Radu 


> On Jun 25, 2020, at 2:52 PM, Wes McKinney <[email protected]> wrote:
> 
> hi Radu,
> 
> It's going to be challenging for me to review in detail until after
> the 1.0.0 release is out, but in general I think there are 3 layers
> that we need to be talking about:
> 
> * Materialized in-memory tables
> * "Virtual" tables, whose in-memory/not-in-memory semantics are not
> exposed -- permitting column selection, iteration as for execution of
> query engine operators (e.g. projection, filter, join, aggregate), and
> random access
> * "Data Frame API": a programming interface for expressing analytical
> operations on virtual tables. A data frame could be exported to
> materialized tables / record batches e.g. for writing to Parquet or
> IPC streams
> 
> In principle the "Data Frame API" shouldn't need to know much about
> the first two layers, instead working with high level primitives and
> leaving the execution of those primitives to the layers below. Does
> this make sense?
> 
> I think we should be pretty strict about separation of concerns
> between these three layers . I'll dig in in more detail sometime after
> July 4.
> 
> Thanks
> Wes
> 
> 
> 
> 
> On Thu, Jun 25, 2020 at 11:50 AM Radu Teodorescu
> <[email protected]> wrote:
>> 
>> Here it is as a pull request:
>> https://github.com/apache/arrow/pull/7548 
>> <https://github.com/apache/arrow/pull/7548>
>> 
>> I hope this can be a starter for an active conversation diving into 
>> specifics, and I look forward to contribute with more design and algorithm 
>> ideas as well as concrete code.
>> 
>>> On Jun 17, 2020, at 6:11 PM, Neal Richardson <[email protected]> 
>>> wrote:
>>> 
>>> Maybe a draft pull request? If you put "WIP" in the pull request title, CI
>>> won't run builds on it, so it's suitable for rough outlines and collecting
>>> feedback.
>>> 
>>> Neal
>>> 
>>> On Wed, Jun 17, 2020 at 2:57 PM Radu Teodorescu
>>> <[email protected]> wrote:
>>> 
>>>> Thank you Wes!
>>>> Yes, both proposals fit very nicely in your Data Frames vision, I see them
>>>> as deep dives on some specifics:
>>>> - the virtual array doc is more fluffy an probably if you agree with the
>>>> general concept, the next logical move is to put out some interfaces indeed
>>>> - the random access doc goes into more details and I am curious what you
>>>> think about some of the concepts
>>>> 
>>>> I will follow up shortly with some interfaces - do you prefer references
>>>> to a repo, inline them in an email or add them as comments to your doc?
>>>> 
>>>> 
>>>>> On Jun 17, 2020, at 4:26 PM, Wes McKinney <[email protected]> wrote:
>>>>> 
>>>>> hi Radu,
>>>>> 
>>>>> I'll read the proposals in more detail when I can and make comments,
>>>>> but this has always been something of interest (see, e.g. [1]). The
>>>>> intent with the "C++ data frames" project that we've discussed (and I
>>>>> continue to labor towards, e.g. recent compute engine work is directly
>>>>> in service of this) has always been to be able to express computations
>>>>> on non-RAM-resident datasets [2]
>>>>> 
>>>>> As one initial high level point of discussion, I think what you're
>>>>> describing in these documents should probably be _new_ C++ classes and
>>>>> _new_ virtual interfaces, not an evolution of the current arrow::Table
>>>>> or arrow::Array/ChunkedArray classes. One practical path forward in
>>>>> terms of discussing implementation issues would be to draft header
>>>>> files proposing what these new class interfaces look like.
>>>>> 
>>>>> - Wes
>>>>> 
>>>>> [1]: https://issues.apache.org/jira/browse/ARROW-1329
>>>>> [2]:
>>>> https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit#heading=h.g70gstc7jq4h
>>>>> 
>>>>> On Wed, Jun 17, 2020 at 2:48 PM Radu Teodorescu
>>>>> <[email protected]> wrote:
>>>>>> 
>>>>>> Hi folks,
>>>>>> While I’ve been communicating with some members of this group in the
>>>> past, this is my first official post so please excuse/correct/guide me as
>>>> needed.
>>>>>> 
>>>>>> Logistics first:
>>>>>> I put most of the content of my proposals in google doc, but if more
>>>> appropriate, we can keep the conversation going by email.
>>>>>> Also the two proposals are pretty independent, so if needed we can
>>>> break it into two separate email threads, but for now I wanted to keep the
>>>> spam low
>>>>>> 
>>>>>> Actual proposals:
>>>>>> Virtual Array - The idea is to be able to handle arrow Tables where
>>>> some of the column data is not (yet) available in memory. For example a
>>>> Table can map to a parquet file, create VirtualArrays for each column chunk
>>>> and only read the actual content if and when the Array is touched.
>>>>>> Virtualize arrow Table <
>>>> https://docs.google.com/document/d/1qXSHSgMZtjNGzWrqDxoBisSoR6gbnRiEztnYihNGLsI/edit?usp=sharing
>>>>> 
>>>>>> Random Access - I find that “application state” for most large scale
>>>> systems is compatible with low level vectorized arrow representation and I
>>>> propose a number of API expansions that would enable thread safe data
>>>> mutation and efficient random access.
>>>>>> Arrow arrays random access <
>>>> https://docs.google.com/document/d/1tIsOhN6mfIAy6F8XRxeKRIqPBN0gKbcmrp2QJ4L3hJ8/edit?usp=sharing
>>>>> 
>>>>>> Please let me know what you think and what is the best course of action
>>>> moving forward.
>>>>>> Thank you
>>>>>> Radu
>>>> 
>>>> 
>>

Re: Proposal for arrow DataFrame low level structure and primitives (Was: Two proposals for expanding arrow Table API (virtual arrays and random access))

Reply via email to