Re: Proposal for arrow DataFrame low level structure and primitives (Was: Two proposals for expanding arrow Table API (virtual arrays and random access))

Radu Teodorescu Wed, 05 Aug 2020 05:43:40 -0700

Wes & crew,
Congratulations and thank you for the successful 1.0 rollout , it is certainly 
making a huge difference for my day job!
Is it a good time now to revive the conversation below? (and 
https://github.com/apache/arrow/pull/7548 ) 
I have also gone ahead and released a prototype the covers some of the more 
hand wavy parts of my interface proposal (aka ways to compose arrays in a 
dataframe that controls the balance between fragmentation and buffer  copying) 
- it is here: https://github.com/raduteo/framespaces/tree/master 
<https://github.com/raduteo/framespaces/tree/master> and it lacks in 
documentation but the basic data structures are robustly implemented and tested 
so if we find merits in the original PR: 
https://github.com/apache/arrow/pull/7548 
<https://github.com/apache/arrow/pull/7548> , there should be a reasonable path 
for implementing most of it.
 
Thank you
Radu


> On Jun 25, 2020, at 3:10 PM, Radu Teodorescu <radukay...@yahoo.com.INVALID> 
> wrote:
> 
> Understood and agreed
> My proposal really addresses a number of mechanisms on layer 2 ( "Virtual" 
> tables) in your taxonomy (I can adjust interface names accordingly as part of 
> the review process).
> One additional element I am proposing here is the ability to insert and 
> modify rows in a vectorized fashion - they follow the same mechanics as 
> “filter” which is effectively (i.e. row removal) 
> and I think they are quite important as an efficiently supported construct 
> (for things like data cleanup, data set updates etc.)
> 
> I’m really looking forward to hear more of your thoughts (as well as anybody 
> else’s who is interested in this topic)
> Radu 
> 
> 
>> On Jun 25, 2020, at 2:52 PM, Wes McKinney <wesmck...@gmail.com> wrote:
>> 
>> hi Radu,
>> 
>> It's going to be challenging for me to review in detail until after
>> the 1.0.0 release is out, but in general I think there are 3 layers
>> that we need to be talking about:
>> 
>> * Materialized in-memory tables
>> * "Virtual" tables, whose in-memory/not-in-memory semantics are not
>> exposed -- permitting column selection, iteration as for execution of
>> query engine operators (e.g. projection, filter, join, aggregate), and
>> random access
>> * "Data Frame API": a programming interface for expressing analytical
>> operations on virtual tables. A data frame could be exported to
>> materialized tables / record batches e.g. for writing to Parquet or
>> IPC streams
>> 
>> In principle the "Data Frame API" shouldn't need to know much about
>> the first two layers, instead working with high level primitives and
>> leaving the execution of those primitives to the layers below. Does
>> this make sense?
>> 
>> I think we should be pretty strict about separation of concerns
>> between these three layers . I'll dig in in more detail sometime after
>> July 4.
>> 
>> Thanks
>> Wes
>> 
>> 
>> 
>> 
>> On Thu, Jun 25, 2020 at 11:50 AM Radu Teodorescu
>> <radukay...@yahoo.com.invalid> wrote:
>>> 
>>> Here it is as a pull request:
>>> https://github.com/apache/arrow/pull/7548 
>>> <https://github.com/apache/arrow/pull/7548>
>>> 
>>> I hope this can be a starter for an active conversation diving into 
>>> specifics, and I look forward to contribute with more design and algorithm 
>>> ideas as well as concrete code.
>>> 
>>>> On Jun 17, 2020, at 6:11 PM, Neal Richardson <neal.p.richard...@gmail.com> 
>>>> wrote:
>>>> 
>>>> Maybe a draft pull request? If you put "WIP" in the pull request title, CI
>>>> won't run builds on it, so it's suitable for rough outlines and collecting
>>>> feedback.
>>>> 
>>>> Neal
>>>> 
>>>> On Wed, Jun 17, 2020 at 2:57 PM Radu Teodorescu
>>>> <radukay...@yahoo.com.invalid> wrote:
>>>> 
>>>>> Thank you Wes!
>>>>> Yes, both proposals fit very nicely in your Data Frames vision, I see them
>>>>> as deep dives on some specifics:
>>>>> - the virtual array doc is more fluffy an probably if you agree with the
>>>>> general concept, the next logical move is to put out some interfaces 
>>>>> indeed
>>>>> - the random access doc goes into more details and I am curious what you
>>>>> think about some of the concepts
>>>>> 
>>>>> I will follow up shortly with some interfaces - do you prefer references
>>>>> to a repo, inline them in an email or add them as comments to your doc?
>>>>> 
>>>>> 
>>>>>> On Jun 17, 2020, at 4:26 PM, Wes McKinney <wesmck...@gmail.com> wrote:
>>>>>> 
>>>>>> hi Radu,
>>>>>> 
>>>>>> I'll read the proposals in more detail when I can and make comments,
>>>>>> but this has always been something of interest (see, e.g. [1]). The
>>>>>> intent with the "C++ data frames" project that we've discussed (and I
>>>>>> continue to labor towards, e.g. recent compute engine work is directly
>>>>>> in service of this) has always been to be able to express computations
>>>>>> on non-RAM-resident datasets [2]
>>>>>> 
>>>>>> As one initial high level point of discussion, I think what you're
>>>>>> describing in these documents should probably be _new_ C++ classes and
>>>>>> _new_ virtual interfaces, not an evolution of the current arrow::Table
>>>>>> or arrow::Array/ChunkedArray classes. One practical path forward in
>>>>>> terms of discussing implementation issues would be to draft header
>>>>>> files proposing what these new class interfaces look like.
>>>>>> 
>>>>>> - Wes
>>>>>> 
>>>>>> [1]: https://issues.apache.org/jira/browse/ARROW-1329
>>>>>> [2]:
>>>>> https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit#heading=h.g70gstc7jq4h
>>>>>> 
>>>>>> On Wed, Jun 17, 2020 at 2:48 PM Radu Teodorescu
>>>>>> <radukay...@yahoo.com.invalid> wrote:
>>>>>>> 
>>>>>>> Hi folks,
>>>>>>> While I’ve been communicating with some members of this group in the
>>>>> past, this is my first official post so please excuse/correct/guide me as
>>>>> needed.
>>>>>>> 
>>>>>>> Logistics first:
>>>>>>> I put most of the content of my proposals in google doc, but if more
>>>>> appropriate, we can keep the conversation going by email.
>>>>>>> Also the two proposals are pretty independent, so if needed we can
>>>>> break it into two separate email threads, but for now I wanted to keep the
>>>>> spam low
>>>>>>> 
>>>>>>> Actual proposals:
>>>>>>> Virtual Array - The idea is to be able to handle arrow Tables where
>>>>> some of the column data is not (yet) available in memory. For example a
>>>>> Table can map to a parquet file, create VirtualArrays for each column 
>>>>> chunk
>>>>> and only read the actual content if and when the Array is touched.
>>>>>>> Virtualize arrow Table <
>>>>> https://docs.google.com/document/d/1qXSHSgMZtjNGzWrqDxoBisSoR6gbnRiEztnYihNGLsI/edit?usp=sharing
>>>>>> 
>>>>>>> Random Access - I find that “application state” for most large scale
>>>>> systems is compatible with low level vectorized arrow representation and I
>>>>> propose a number of API expansions that would enable thread safe data
>>>>> mutation and efficient random access.
>>>>>>> Arrow arrays random access <
>>>>> https://docs.google.com/document/d/1tIsOhN6mfIAy6F8XRxeKRIqPBN0gKbcmrp2QJ4L3hJ8/edit?usp=sharing
>>>>>> 
>>>>>>> Please let me know what you think and what is the best course of action
>>>>> moving forward.
>>>>>>> Thank you
>>>>>>> Radu
>>>>> 
>>>>> 
>>> 
>

Re: Proposal for arrow DataFrame low level structure and primitives (Was: Two proposals for expanding arrow Table API (virtual arrays and random access))

Reply via email to