subject:"Proposal for arrow DataFrame low level structure and primitives \(Was\: Two proposals for expanding arrow Table API \(virtual arrays and random access\)\)"

Re: Proposal for arrow DataFrame low level structure and primitives (Was: Two proposals for expanding arrow Table API (virtual arrays and random access))

2020-08-05 Thread Radu Teodorescu



> I will have a closer look and comment most likely next week.

Thank you!

> 
> Unfortunately, having code developed in external repositories increases the
> complexity of importing that code back into the Apache project  Not sure if
> you’re interested in preemptively following the project’s style guide (file
> naming, C++ code style, etc) but that would also help.

I understand that challenge, my intent was to prove to myself and anyone else, 
that there is a satisfying implementation that provides the semantics and the 
performance levels I am referring to in my proposals. It is a reference 
implementation, but certainly not something that can be dropped in directly in 
its current form (for example, I am leaning quite heavily on c++14/17 and a bit 
of 20), but if the vision makes sense I would love to bring that into arrow.

> On Wed, Aug 5, 2020 at 7:43 AM Radu Teodorescu 
> wrote:
> 
>> Wes & crew,
>> Congratulations and thank you for the successful 1.0 rollout , it is
>> certainly making a huge difference for my day job!
>> Is it a good time now to revive the conversation below? (and
>> https://github.com/apache/arrow/pull/7548 )
>> I have also gone ahead and released a prototype the covers some of the
>> more hand wavy parts of my interface proposal (aka ways to compose arrays
>> in a dataframe that controls the balance between fragmentation and buffer
>> copying) - it is here: https://github.com/raduteo/framespaces/tree/master
>>  and it lacks in
>> documentation but the basic data structures are robustly implemented and
>> tested so if we find merits in the original PR:
>> https://github.com/apache/arrow/pull/7548 <
>> https://github.com/apache/arrow/pull/7548> , there should be a reasonable
>> path for implementing most of it.
>> 
>> Thank you
>> Radu
>> 
>> 
>>> On Jun 25, 2020, at 3:10 PM, Radu Teodorescu
>>  wrote:
>>> 
>>> Understood and agreed
>>> My proposal really addresses a number of mechanisms on layer 2 (
>> "Virtual" tables) in your taxonomy (I can adjust interface names
>> accordingly as part of the review process).
>>> One additional element I am proposing here is the ability to insert and
>> modify rows in a vectorized fashion - they follow the same mechanics as
>> “filter” which is effectively (i.e. row removal)
>>> and I think they are quite important as an efficiently supported
>> construct (for things like data cleanup, data set updates etc.)
>>> 
>>> I’m really looking forward to hear more of your thoughts (as well as
>> anybody else’s who is interested in this topic)
>>> Radu
>>> 
>>> 
 On Jun 25, 2020, at 2:52 PM, Wes McKinney  wrote:
 
 hi Radu,
 
 It's going to be challenging for me to review in detail until after
 the 1.0.0 release is out, but in general I think there are 3 layers
 that we need to be talking about:
 
 * Materialized in-memory tables
 * "Virtual" tables, whose in-memory/not-in-memory semantics are not
 exposed -- permitting column selection, iteration as for execution of
 query engine operators (e.g. projection, filter, join, aggregate), and
 random access
 * "Data Frame API": a programming interface for expressing analytical
 operations on virtual tables. A data frame could be exported to
 materialized tables / record batches e.g. for writing to Parquet or
 IPC streams
 
 In principle the "Data Frame API" shouldn't need to know much about
 the first two layers, instead working with high level primitives and
 leaving the execution of those primitives to the layers below. Does
 this make sense?
 
 I think we should be pretty strict about separation of concerns
 between these three layers . I'll dig in in more detail sometime after
 July 4.
 
 Thanks
 Wes
 
 
 
 
 On Thu, Jun 25, 2020 at 11:50 AM Radu Teodorescu
  wrote:
> 
> Here it is as a pull request:
> https://github.com/apache/arrow/pull/7548 <
>> https://github.com/apache/arrow/pull/7548>
> 
> I hope this can be a starter for an active conversation diving into
>> specifics, and I look forward to contribute with more design and algorithm
>> ideas as well as concrete code.
> 
>> On Jun 17, 2020, at 6:11 PM, Neal Richardson <
>> neal.p.richard...@gmail.com> wrote:
>> 
>> Maybe a draft pull request? If you put "WIP" in the pull request
>> title, CI
>> won't run builds on it, so it's suitable for rough outlines and
>> collecting
>> feedback.
>> 
>> Neal
>> 
>> On Wed, Jun 17, 2020 at 2:57 PM Radu Teodorescu
>>  wrote:
>> 
>>> Thank you Wes!
>>> Yes, both proposals fit very nicely in your Data Frames vision, I
>> see them
>>> as deep dives on some specifics:
>>> - the virtual array doc is more fluffy an probably if you agree with
>> the
>>> general concept, the next logical move is to put out some interfaces
>> indeed
>>> -

Re: Proposal for arrow DataFrame low level structure and primitives (Was: Two proposals for expanding arrow Table API (virtual arrays and random access))

2020-08-05 Thread Wes McKinney

I will have a closer look and comment most likely next week.

Unfortunately, having code developed in external repositories increases the
complexity of importing that code back into the Apache project  Not sure if
you’re interested in preemptively following the project’s style guide (file
naming, C++ code style, etc) but that would also help.

On Wed, Aug 5, 2020 at 7:43 AM Radu Teodorescu 
wrote:

> Wes & crew,
> Congratulations and thank you for the successful 1.0 rollout , it is
> certainly making a huge difference for my day job!
> Is it a good time now to revive the conversation below? (and
> https://github.com/apache/arrow/pull/7548 )
> I have also gone ahead and released a prototype the covers some of the
> more hand wavy parts of my interface proposal (aka ways to compose arrays
> in a dataframe that controls the balance between fragmentation and buffer
> copying) - it is here: https://github.com/raduteo/framespaces/tree/master
>  and it lacks in
> documentation but the basic data structures are robustly implemented and
> tested so if we find merits in the original PR:
> https://github.com/apache/arrow/pull/7548 <
> https://github.com/apache/arrow/pull/7548> , there should be a reasonable
> path for implementing most of it.
>
> Thank you
> Radu
>
>
> > On Jun 25, 2020, at 3:10 PM, Radu Teodorescu
>  wrote:
> >
> > Understood and agreed
> > My proposal really addresses a number of mechanisms on layer 2 (
> "Virtual" tables) in your taxonomy (I can adjust interface names
> accordingly as part of the review process).
> > One additional element I am proposing here is the ability to insert and
> modify rows in a vectorized fashion - they follow the same mechanics as
> “filter” which is effectively (i.e. row removal)
> > and I think they are quite important as an efficiently supported
> construct (for things like data cleanup, data set updates etc.)
> >
> > I’m really looking forward to hear more of your thoughts (as well as
> anybody else’s who is interested in this topic)
> > Radu
> >
> >
> >> On Jun 25, 2020, at 2:52 PM, Wes McKinney  wrote:
> >>
> >> hi Radu,
> >>
> >> It's going to be challenging for me to review in detail until after
> >> the 1.0.0 release is out, but in general I think there are 3 layers
> >> that we need to be talking about:
> >>
> >> * Materialized in-memory tables
> >> * "Virtual" tables, whose in-memory/not-in-memory semantics are not
> >> exposed -- permitting column selection, iteration as for execution of
> >> query engine operators (e.g. projection, filter, join, aggregate), and
> >> random access
> >> * "Data Frame API": a programming interface for expressing analytical
> >> operations on virtual tables. A data frame could be exported to
> >> materialized tables / record batches e.g. for writing to Parquet or
> >> IPC streams
> >>
> >> In principle the "Data Frame API" shouldn't need to know much about
> >> the first two layers, instead working with high level primitives and
> >> leaving the execution of those primitives to the layers below. Does
> >> this make sense?
> >>
> >> I think we should be pretty strict about separation of concerns
> >> between these three layers . I'll dig in in more detail sometime after
> >> July 4.
> >>
> >> Thanks
> >> Wes
> >>
> >>
> >>
> >>
> >> On Thu, Jun 25, 2020 at 11:50 AM Radu Teodorescu
> >>  wrote:
> >>>
> >>> Here it is as a pull request:
> >>> https://github.com/apache/arrow/pull/7548 <
> https://github.com/apache/arrow/pull/7548>
> >>>
> >>> I hope this can be a starter for an active conversation diving into
> specifics, and I look forward to contribute with more design and algorithm
> ideas as well as concrete code.
> >>>
>  On Jun 17, 2020, at 6:11 PM, Neal Richardson <
> neal.p.richard...@gmail.com> wrote:
> 
>  Maybe a draft pull request? If you put "WIP" in the pull request
> title, CI
>  won't run builds on it, so it's suitable for rough outlines and
> collecting
>  feedback.
> 
>  Neal
> 
>  On Wed, Jun 17, 2020 at 2:57 PM Radu Teodorescu
>   wrote:
> 
> > Thank you Wes!
> > Yes, both proposals fit very nicely in your Data Frames vision, I
> see them
> > as deep dives on some specifics:
> > - the virtual array doc is more fluffy an probably if you agree with
> the
> > general concept, the next logical move is to put out some interfaces
> indeed
> > - the random access doc goes into more details and I am curious what
> you
> > think about some of the concepts
> >
> > I will follow up shortly with some interfaces - do you prefer
> references
> > to a repo, inline them in an email or add them as comments to your
> doc?
> >
> >
> >> On Jun 17, 2020, at 4:26 PM, Wes McKinney 
> wrote:
> >>
> >> hi Radu,
> >>
> >> I'll read the proposals in more detail when I can and make comments,
> >> but this has always been something of interest (see, e.g. [1]). The
> >>

Re: Proposal for arrow DataFrame low level structure and primitives (Was: Two proposals for expanding arrow Table API (virtual arrays and random access))

2020-08-05 Thread Radu Teodorescu

Wes & crew,
Congratulations and thank you for the successful 1.0 rollout , it is certainly 
making a huge difference for my day job!
Is it a good time now to revive the conversation below? (and 
https://github.com/apache/arrow/pull/7548 ) 
I have also gone ahead and released a prototype the covers some of the more 
hand wavy parts of my interface proposal (aka ways to compose arrays in a 
dataframe that controls the balance between fragmentation and buffer  copying) 
- it is here: https://github.com/raduteo/framespaces/tree/master 
 and it lacks in 
documentation but the basic data structures are robustly implemented and tested 
so if we find merits in the original PR: 
https://github.com/apache/arrow/pull/7548 
 , there should be a reasonable path 
for implementing most of it.
 
Thank you
Radu
 

> On Jun 25, 2020, at 3:10 PM, Radu Teodorescu  
> wrote:
> 
> Understood and agreed
> My proposal really addresses a number of mechanisms on layer 2 ( "Virtual" 
> tables) in your taxonomy (I can adjust interface names accordingly as part of 
> the review process).
> One additional element I am proposing here is the ability to insert and 
> modify rows in a vectorized fashion - they follow the same mechanics as 
> “filter” which is effectively (i.e. row removal) 
> and I think they are quite important as an efficiently supported construct 
> (for things like data cleanup, data set updates etc.)
> 
> I’m really looking forward to hear more of your thoughts (as well as anybody 
> else’s who is interested in this topic)
> Radu 
> 
> 
>> On Jun 25, 2020, at 2:52 PM, Wes McKinney  wrote:
>> 
>> hi Radu,
>> 
>> It's going to be challenging for me to review in detail until after
>> the 1.0.0 release is out, but in general I think there are 3 layers
>> that we need to be talking about:
>> 
>> * Materialized in-memory tables
>> * "Virtual" tables, whose in-memory/not-in-memory semantics are not
>> exposed -- permitting column selection, iteration as for execution of
>> query engine operators (e.g. projection, filter, join, aggregate), and
>> random access
>> * "Data Frame API": a programming interface for expressing analytical
>> operations on virtual tables. A data frame could be exported to
>> materialized tables / record batches e.g. for writing to Parquet or
>> IPC streams
>> 
>> In principle the "Data Frame API" shouldn't need to know much about
>> the first two layers, instead working with high level primitives and
>> leaving the execution of those primitives to the layers below. Does
>> this make sense?
>> 
>> I think we should be pretty strict about separation of concerns
>> between these three layers . I'll dig in in more detail sometime after
>> July 4.
>> 
>> Thanks
>> Wes
>> 
>> 
>> 
>> 
>> On Thu, Jun 25, 2020 at 11:50 AM Radu Teodorescu
>>  wrote:
>>> 
>>> Here it is as a pull request:
>>> https://github.com/apache/arrow/pull/7548 
>>> 
>>> 
>>> I hope this can be a starter for an active conversation diving into 
>>> specifics, and I look forward to contribute with more design and algorithm 
>>> ideas as well as concrete code.
>>> 
 On Jun 17, 2020, at 6:11 PM, Neal Richardson  
 wrote:
 
 Maybe a draft pull request? If you put "WIP" in the pull request title, CI
 won't run builds on it, so it's suitable for rough outlines and collecting
 feedback.
 
 Neal
 
 On Wed, Jun 17, 2020 at 2:57 PM Radu Teodorescu
  wrote:
 
> Thank you Wes!
> Yes, both proposals fit very nicely in your Data Frames vision, I see them
> as deep dives on some specifics:
> - the virtual array doc is more fluffy an probably if you agree with the
> general concept, the next logical move is to put out some interfaces 
> indeed
> - the random access doc goes into more details and I am curious what you
> think about some of the concepts
> 
> I will follow up shortly with some interfaces - do you prefer references
> to a repo, inline them in an email or add them as comments to your doc?
> 
> 
>> On Jun 17, 2020, at 4:26 PM, Wes McKinney  wrote:
>> 
>> hi Radu,
>> 
>> I'll read the proposals in more detail when I can and make comments,
>> but this has always been something of interest (see, e.g. [1]). The
>> intent with the "C++ data frames" project that we've discussed (and I
>> continue to labor towards, e.g. recent compute engine work is directly
>> in service of this) has always been to be able to express computations
>> on non-RAM-resident datasets [2]
>> 
>> As one initial high level point of discussion, I think what you're
>> describing in these documents should probably be _new_ C++ classes and
>> _new_ virtual interfaces, not an evolution of the current arrow::Table
>> or arrow::Array/ChunkedArray classes. One practical path forward in

Re: Proposal for arrow DataFrame low level structure and primitives (Was: Two proposals for expanding arrow Table API (virtual arrays and random access))

2020-06-25 Thread Radu Teodorescu

Understood and agreed
My proposal really addresses a number of mechanisms on layer 2 ( "Virtual" 
tables) in your taxonomy (I can adjust interface names accordingly as part of 
the review process).
One additional element I am proposing here is the ability to insert and modify 
rows in a vectorized fashion - they follow the same mechanics as “filter” which 
is effectively (i.e. row removal) 
and I think they are quite important as an efficiently supported construct (for 
things like data cleanup, data set updates etc.)

I’m really looking forward to hear more of your thoughts (as well as anybody 
else’s who is interested in this topic)
Radu 


> On Jun 25, 2020, at 2:52 PM, Wes McKinney  wrote:
> 
> hi Radu,
> 
> It's going to be challenging for me to review in detail until after
> the 1.0.0 release is out, but in general I think there are 3 layers
> that we need to be talking about:
> 
> * Materialized in-memory tables
> * "Virtual" tables, whose in-memory/not-in-memory semantics are not
> exposed -- permitting column selection, iteration as for execution of
> query engine operators (e.g. projection, filter, join, aggregate), and
> random access
> * "Data Frame API": a programming interface for expressing analytical
> operations on virtual tables. A data frame could be exported to
> materialized tables / record batches e.g. for writing to Parquet or
> IPC streams
> 
> In principle the "Data Frame API" shouldn't need to know much about
> the first two layers, instead working with high level primitives and
> leaving the execution of those primitives to the layers below. Does
> this make sense?
> 
> I think we should be pretty strict about separation of concerns
> between these three layers . I'll dig in in more detail sometime after
> July 4.
> 
> Thanks
> Wes
> 
> 
> 
> 
> On Thu, Jun 25, 2020 at 11:50 AM Radu Teodorescu
>  wrote:
>> 
>> Here it is as a pull request:
>> https://github.com/apache/arrow/pull/7548 
>> 
>> 
>> I hope this can be a starter for an active conversation diving into 
>> specifics, and I look forward to contribute with more design and algorithm 
>> ideas as well as concrete code.
>> 
>>> On Jun 17, 2020, at 6:11 PM, Neal Richardson  
>>> wrote:
>>> 
>>> Maybe a draft pull request? If you put "WIP" in the pull request title, CI
>>> won't run builds on it, so it's suitable for rough outlines and collecting
>>> feedback.
>>> 
>>> Neal
>>> 
>>> On Wed, Jun 17, 2020 at 2:57 PM Radu Teodorescu
>>>  wrote:
>>> 
 Thank you Wes!
 Yes, both proposals fit very nicely in your Data Frames vision, I see them
 as deep dives on some specifics:
 - the virtual array doc is more fluffy an probably if you agree with the
 general concept, the next logical move is to put out some interfaces indeed
 - the random access doc goes into more details and I am curious what you
 think about some of the concepts
 
 I will follow up shortly with some interfaces - do you prefer references
 to a repo, inline them in an email or add them as comments to your doc?
 
 
> On Jun 17, 2020, at 4:26 PM, Wes McKinney  wrote:
> 
> hi Radu,
> 
> I'll read the proposals in more detail when I can and make comments,
> but this has always been something of interest (see, e.g. [1]). The
> intent with the "C++ data frames" project that we've discussed (and I
> continue to labor towards, e.g. recent compute engine work is directly
> in service of this) has always been to be able to express computations
> on non-RAM-resident datasets [2]
> 
> As one initial high level point of discussion, I think what you're
> describing in these documents should probably be _new_ C++ classes and
> _new_ virtual interfaces, not an evolution of the current arrow::Table
> or arrow::Array/ChunkedArray classes. One practical path forward in
> terms of discussing implementation issues would be to draft header
> files proposing what these new class interfaces look like.
> 
> - Wes
> 
> [1]: https://issues.apache.org/jira/browse/ARROW-1329
> [2]:
 https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit#heading=h.g70gstc7jq4h
> 
> On Wed, Jun 17, 2020 at 2:48 PM Radu Teodorescu
>  wrote:
>> 
>> Hi folks,
>> While I’ve been communicating with some members of this group in the
 past, this is my first official post so please excuse/correct/guide me as
 needed.
>> 
>> Logistics first:
>> I put most of the content of my proposals in google doc, but if more
 appropriate, we can keep the conversation going by email.
>> Also the two proposals are pretty independent, so if needed we can
 break it into two separate email threads, but for now I wanted to keep the
 spam low
>> 
>> Actual proposals:
>> Virtual Array - The idea is to be able to handle arrow Tables where
 some of the column da

Re: Proposal for arrow DataFrame low level structure and primitives (Was: Two proposals for expanding arrow Table API (virtual arrays and random access))

2020-06-25 Thread Wes McKinney

hi Radu,

It's going to be challenging for me to review in detail until after
the 1.0.0 release is out, but in general I think there are 3 layers
that we need to be talking about:

* Materialized in-memory tables
* "Virtual" tables, whose in-memory/not-in-memory semantics are not
exposed -- permitting column selection, iteration as for execution of
query engine operators (e.g. projection, filter, join, aggregate), and
random access
* "Data Frame API": a programming interface for expressing analytical
operations on virtual tables. A data frame could be exported to
materialized tables / record batches e.g. for writing to Parquet or
IPC streams

In principle the "Data Frame API" shouldn't need to know much about
the first two layers, instead working with high level primitives and
leaving the execution of those primitives to the layers below. Does
this make sense?

I think we should be pretty strict about separation of concerns
between these three layers . I'll dig in in more detail sometime after
July 4.

Thanks
Wes




On Thu, Jun 25, 2020 at 11:50 AM Radu Teodorescu
 wrote:
>
> Here it is as a pull request:
> https://github.com/apache/arrow/pull/7548 
> 
>
> I hope this can be a starter for an active conversation diving into 
> specifics, and I look forward to contribute with more design and algorithm 
> ideas as well as concrete code.
>
> > On Jun 17, 2020, at 6:11 PM, Neal Richardson  
> > wrote:
> >
> > Maybe a draft pull request? If you put "WIP" in the pull request title, CI
> > won't run builds on it, so it's suitable for rough outlines and collecting
> > feedback.
> >
> > Neal
> >
> > On Wed, Jun 17, 2020 at 2:57 PM Radu Teodorescu
> >  wrote:
> >
> >> Thank you Wes!
> >> Yes, both proposals fit very nicely in your Data Frames vision, I see them
> >> as deep dives on some specifics:
> >> - the virtual array doc is more fluffy an probably if you agree with the
> >> general concept, the next logical move is to put out some interfaces indeed
> >> - the random access doc goes into more details and I am curious what you
> >> think about some of the concepts
> >>
> >> I will follow up shortly with some interfaces - do you prefer references
> >> to a repo, inline them in an email or add them as comments to your doc?
> >>
> >>
> >>> On Jun 17, 2020, at 4:26 PM, Wes McKinney  wrote:
> >>>
> >>> hi Radu,
> >>>
> >>> I'll read the proposals in more detail when I can and make comments,
> >>> but this has always been something of interest (see, e.g. [1]). The
> >>> intent with the "C++ data frames" project that we've discussed (and I
> >>> continue to labor towards, e.g. recent compute engine work is directly
> >>> in service of this) has always been to be able to express computations
> >>> on non-RAM-resident datasets [2]
> >>>
> >>> As one initial high level point of discussion, I think what you're
> >>> describing in these documents should probably be _new_ C++ classes and
> >>> _new_ virtual interfaces, not an evolution of the current arrow::Table
> >>> or arrow::Array/ChunkedArray classes. One practical path forward in
> >>> terms of discussing implementation issues would be to draft header
> >>> files proposing what these new class interfaces look like.
> >>>
> >>> - Wes
> >>>
> >>> [1]: https://issues.apache.org/jira/browse/ARROW-1329
> >>> [2]:
> >> https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit#heading=h.g70gstc7jq4h
> >>>
> >>> On Wed, Jun 17, 2020 at 2:48 PM Radu Teodorescu
> >>>  wrote:
> 
>  Hi folks,
>  While I’ve been communicating with some members of this group in the
> >> past, this is my first official post so please excuse/correct/guide me as
> >> needed.
> 
>  Logistics first:
>  I put most of the content of my proposals in google doc, but if more
> >> appropriate, we can keep the conversation going by email.
>  Also the two proposals are pretty independent, so if needed we can
> >> break it into two separate email threads, but for now I wanted to keep the
> >> spam low
> 
>  Actual proposals:
>  Virtual Array - The idea is to be able to handle arrow Tables where
> >> some of the column data is not (yet) available in memory. For example a
> >> Table can map to a parquet file, create VirtualArrays for each column chunk
> >> and only read the actual content if and when the Array is touched.
>  Virtualize arrow Table <
> >> https://docs.google.com/document/d/1qXSHSgMZtjNGzWrqDxoBisSoR6gbnRiEztnYihNGLsI/edit?usp=sharing
> >>>
>  Random Access - I find that “application state” for most large scale
> >> systems is compatible with low level vectorized arrow representation and I
> >> propose a number of API expansions that would enable thread safe data
> >> mutation and efficient random access.
>  Arrow arrays random access <
> >> https://docs.google.com/document/d/1tIsOhN6mfIAy6F8XRxeKRIqPBN0gKbcmrp2QJ4L3hJ8/edit?usp=sharing
> >>>
>  Please let me know what you thi

Proposal for arrow DataFrame low level structure and primitives (Was: Two proposals for expanding arrow Table API (virtual arrays and random access))

2020-06-25 Thread Radu Teodorescu

Here it is as a pull request:
https://github.com/apache/arrow/pull/7548 


I hope this can be a starter for an active conversation diving into specifics, 
and I look forward to contribute with more design and algorithm ideas as well 
as concrete code.

> On Jun 17, 2020, at 6:11 PM, Neal Richardson  
> wrote:
> 
> Maybe a draft pull request? If you put "WIP" in the pull request title, CI
> won't run builds on it, so it's suitable for rough outlines and collecting
> feedback.
> 
> Neal
> 
> On Wed, Jun 17, 2020 at 2:57 PM Radu Teodorescu
>  wrote:
> 
>> Thank you Wes!
>> Yes, both proposals fit very nicely in your Data Frames vision, I see them
>> as deep dives on some specifics:
>> - the virtual array doc is more fluffy an probably if you agree with the
>> general concept, the next logical move is to put out some interfaces indeed
>> - the random access doc goes into more details and I am curious what you
>> think about some of the concepts
>> 
>> I will follow up shortly with some interfaces - do you prefer references
>> to a repo, inline them in an email or add them as comments to your doc?
>> 
>> 
>>> On Jun 17, 2020, at 4:26 PM, Wes McKinney  wrote:
>>> 
>>> hi Radu,
>>> 
>>> I'll read the proposals in more detail when I can and make comments,
>>> but this has always been something of interest (see, e.g. [1]). The
>>> intent with the "C++ data frames" project that we've discussed (and I
>>> continue to labor towards, e.g. recent compute engine work is directly
>>> in service of this) has always been to be able to express computations
>>> on non-RAM-resident datasets [2]
>>> 
>>> As one initial high level point of discussion, I think what you're
>>> describing in these documents should probably be _new_ C++ classes and
>>> _new_ virtual interfaces, not an evolution of the current arrow::Table
>>> or arrow::Array/ChunkedArray classes. One practical path forward in
>>> terms of discussing implementation issues would be to draft header
>>> files proposing what these new class interfaces look like.
>>> 
>>> - Wes
>>> 
>>> [1]: https://issues.apache.org/jira/browse/ARROW-1329
>>> [2]:
>> https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit#heading=h.g70gstc7jq4h
>>> 
>>> On Wed, Jun 17, 2020 at 2:48 PM Radu Teodorescu
>>>  wrote:
 
 Hi folks,
 While I’ve been communicating with some members of this group in the
>> past, this is my first official post so please excuse/correct/guide me as
>> needed.
 
 Logistics first:
 I put most of the content of my proposals in google doc, but if more
>> appropriate, we can keep the conversation going by email.
 Also the two proposals are pretty independent, so if needed we can
>> break it into two separate email threads, but for now I wanted to keep the
>> spam low
 
 Actual proposals:
 Virtual Array - The idea is to be able to handle arrow Tables where
>> some of the column data is not (yet) available in memory. For example a
>> Table can map to a parquet file, create VirtualArrays for each column chunk
>> and only read the actual content if and when the Array is touched.
 Virtualize arrow Table <
>> https://docs.google.com/document/d/1qXSHSgMZtjNGzWrqDxoBisSoR6gbnRiEztnYihNGLsI/edit?usp=sharing
>>> 
 Random Access - I find that “application state” for most large scale
>> systems is compatible with low level vectorized arrow representation and I
>> propose a number of API expansions that would enable thread safe data
>> mutation and efficient random access.
 Arrow arrays random access <
>> https://docs.google.com/document/d/1tIsOhN6mfIAy6F8XRxeKRIqPBN0gKbcmrp2QJ4L3hJ8/edit?usp=sharing
>>> 
 Please let me know what you think and what is the best course of action
>> moving forward.
 Thank you
 Radu
>> 
>>

Re: Proposal for arrow DataFrame low level structure and primitives (Was: Two proposals for expanding arrow Table API (virtual arrays and random access))

Re: Proposal for arrow DataFrame low level structure and primitives (Was: Two proposals for expanding arrow Table API (virtual arrays and random access))

Re: Proposal for arrow DataFrame low level structure and primitives (Was: Two proposals for expanding arrow Table API (virtual arrays and random access))

Re: Proposal for arrow DataFrame low level structure and primitives (Was: Two proposals for expanding arrow Table API (virtual arrays and random access))

Re: Proposal for arrow DataFrame low level structure and primitives (Was: Two proposals for expanding arrow Table API (virtual arrays and random access))

Proposal for arrow DataFrame low level structure and primitives (Was: Two proposals for expanding arrow Table API (virtual arrays and random access))

6 matches

Site Navigation

Mail list logo

Footer information