[DISCUSS] Developing a "data frame" subproject in the Arrow C++ libraries

2019-05-20 Thread Wes McKinney
hi folks,

I'm interested in starting to build a so-called "data frame" interface
as a moderately opinionated, higher-level usability layer for
interacting with Arrow-based chunked in-memory data. I've had numerous
discussions (mostly in-person) over the last few years about this and
it feels to me that if we don't build something like this in Apache
Arrow that we could end up with several third party efforts without
much community discussion or collaboration, which would be sad.

Another anti-pattern that is occurring is that users are loading data
into Arrow, converting to a library like pandas in order to do some
simple in-memory data manipulations, then converting back to Arrow.
This is not the intended long term mode of operation.

I wrote in significantly more detail (~7-8 pages) about the context
and motivation for this project:

https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit?usp=sharing

Note that this would be a parallel effort to go alongside the
previously-discussed "Query Engine" project, and the two things are
intended to work together. Since we are creating computational
kernels, this would also provide some immediacy in being able to
invoke kernels easily on large in-memory datasets without having to
wait for a more full-fledged query engine system to be developed

The details with these kinds of projects can be bedeviling so my
approach would be to begin to lay down the core abstractions and basic
APIs and use the project to drive the agenda for kernel development
(which can also be used in the context of a query engine runtime).
>From my past experience designing pandas and some other in-memory
analytics projects, I have some idea of the kinds of mistakes or
design patterns I would like to _avoid_ in this effort, but others may
have some experiences they can offer to inform the design approach as
well.

Looking forward to comments and discussion.

- Wes


RE: [DISCUSS] Developing a "data frame" subproject in the Arrow C++ libraries

2019-08-12 Thread Eric Erhardt
Hey Wes,

I just wanted to check-in on this work. Have there been any updates to the 
Arrow "data frame" project worth sharing?

Thanks,
Eric

-Original Message-
From: Wes McKinney  
Sent: Tuesday, May 21, 2019 8:17 AM
To: dev@arrow.apache.org
Subject: Re: [DISCUSS] Developing a "data frame" subproject in the Arrow C++ 
libraries

On Tue, May 21, 2019, 8:43 AM Antoine Pitrou  wrote:

>
> Le 21/05/2019 à 13:42, Wes McKinney a écrit :
> > hi Antoine,
> >
> > On Tue, May 21, 2019 at 5:48 AM Antoine Pitrou 
> wrote:
> >>
> >>
> >> Hi Wes,
> >>
> >> How does copy-on-write play together with memory-mapped data?  It 
> >> seems that, depending on whether the memory map has several 
> >> concurrent users (a condition which may be timing-dependent), we 
> >> will either persist changes on disk or make them ephemeral in 
> >> memory.  That doesn't sound very user-friendly, IMHO.
> >
> > With memory-mapping, any Buffer is sliced from the parent MemoryMap 
> > [1] so mutating the data on disk using this interface wouldn't be 
> > possible with the way that I've framed it.
>
> Hmm... I always forget that SliceBuffer returns a read-only view.
>

The more important issue is that parent_ is non-null. The idea is that no 
mutation is allowed if we reason that another Buffer object has access to the 
address space of interest. I think this style of copy-on-write is a reasonable 
compromise that prevents most kinds of defensive copying.


> Regards
>
> Antoine.
>


Re: [DISCUSS] Developing a "data frame" subproject in the Arrow C++ libraries

2019-08-12 Thread Wes McKinney
hi Eric -- there have not been any patches yet related to it. I'm
currently in the midst of some internal restructuring of the Parquet
C++ library to address long-standing efficiency and memory use issues.
It's my intention to spend time on the data frame project as one of my
next focus areas, likely to be after Labor Day.

- Wes

On Mon, Aug 12, 2019 at 10:28 AM Eric Erhardt
 wrote:
>
> Hey Wes,
>
> I just wanted to check-in on this work. Have there been any updates to the 
> Arrow "data frame" project worth sharing?
>
> Thanks,
> Eric
>
> -Original Message-
> From: Wes McKinney 
> Sent: Tuesday, May 21, 2019 8:17 AM
> To: dev@arrow.apache.org
> Subject: Re: [DISCUSS] Developing a "data frame" subproject in the Arrow C++ 
> libraries
>
> On Tue, May 21, 2019, 8:43 AM Antoine Pitrou  wrote:
>
> >
> > Le 21/05/2019 à 13:42, Wes McKinney a écrit :
> > > hi Antoine,
> > >
> > > On Tue, May 21, 2019 at 5:48 AM Antoine Pitrou 
> > wrote:
> > >>
> > >>
> > >> Hi Wes,
> > >>
> > >> How does copy-on-write play together with memory-mapped data?  It
> > >> seems that, depending on whether the memory map has several
> > >> concurrent users (a condition which may be timing-dependent), we
> > >> will either persist changes on disk or make them ephemeral in
> > >> memory.  That doesn't sound very user-friendly, IMHO.
> > >
> > > With memory-mapping, any Buffer is sliced from the parent MemoryMap
> > > [1] so mutating the data on disk using this interface wouldn't be
> > > possible with the way that I've framed it.
> >
> > Hmm... I always forget that SliceBuffer returns a read-only view.
> >
>
> The more important issue is that parent_ is non-null. The idea is that no 
> mutation is allowed if we reason that another Buffer object has access to the 
> address space of interest. I think this style of copy-on-write is a 
> reasonable compromise that prevents most kinds of defensive copying.
>
>
> > Regards
> >
> > Antoine.
> >


Re: [DISCUSS] Developing a "data frame" subproject in the Arrow C++ libraries

2019-05-20 Thread Micah Kornfield
Hi Wes,
It looks like comments are turned off on the doc, this intentional?

Thanks,
Micah

On Mon, May 20, 2019 at 3:49 PM Wes McKinney  wrote:

> hi folks,
>
> I'm interested in starting to build a so-called "data frame" interface
> as a moderately opinionated, higher-level usability layer for
> interacting with Arrow-based chunked in-memory data. I've had numerous
> discussions (mostly in-person) over the last few years about this and
> it feels to me that if we don't build something like this in Apache
> Arrow that we could end up with several third party efforts without
> much community discussion or collaboration, which would be sad.
>
> Another anti-pattern that is occurring is that users are loading data
> into Arrow, converting to a library like pandas in order to do some
> simple in-memory data manipulations, then converting back to Arrow.
> This is not the intended long term mode of operation.
>
> I wrote in significantly more detail (~7-8 pages) about the context
> and motivation for this project:
>
>
> https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit?usp=sharing
>
> Note that this would be a parallel effort to go alongside the
> previously-discussed "Query Engine" project, and the two things are
> intended to work together. Since we are creating computational
> kernels, this would also provide some immediacy in being able to
> invoke kernels easily on large in-memory datasets without having to
> wait for a more full-fledged query engine system to be developed
>
> The details with these kinds of projects can be bedeviling so my
> approach would be to begin to lay down the core abstractions and basic
> APIs and use the project to drive the agenda for kernel development
> (which can also be used in the context of a query engine runtime).
> From my past experience designing pandas and some other in-memory
> analytics projects, I have some idea of the kinds of mistakes or
> design patterns I would like to _avoid_ in this effort, but others may
> have some experiences they can offer to inform the design approach as
> well.
>
> Looking forward to comments and discussion.
>
> - Wes
>


Re: [DISCUSS] Developing a "data frame" subproject in the Arrow C++ libraries

2019-05-21 Thread Antoine Pitrou


Hi Wes,

How does copy-on-write play together with memory-mapped data?  It seems
that, depending on whether the memory map has several concurrent users
(a condition which may be timing-dependent), we will either persist
changes on disk or make them ephemeral in memory.  That doesn't sound
very user-friendly, IMHO.

Regards

Antoine.


Le 21/05/2019 à 00:39, Wes McKinney a écrit :
> hi folks,
> 
> I'm interested in starting to build a so-called "data frame" interface
> as a moderately opinionated, higher-level usability layer for
> interacting with Arrow-based chunked in-memory data. I've had numerous
> discussions (mostly in-person) over the last few years about this and
> it feels to me that if we don't build something like this in Apache
> Arrow that we could end up with several third party efforts without
> much community discussion or collaboration, which would be sad.
> 
> Another anti-pattern that is occurring is that users are loading data
> into Arrow, converting to a library like pandas in order to do some
> simple in-memory data manipulations, then converting back to Arrow.
> This is not the intended long term mode of operation.
> 
> I wrote in significantly more detail (~7-8 pages) about the context
> and motivation for this project:
> 
> https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit?usp=sharing
> 
> Note that this would be a parallel effort to go alongside the
> previously-discussed "Query Engine" project, and the two things are
> intended to work together. Since we are creating computational
> kernels, this would also provide some immediacy in being able to
> invoke kernels easily on large in-memory datasets without having to
> wait for a more full-fledged query engine system to be developed
> 
> The details with these kinds of projects can be bedeviling so my
> approach would be to begin to lay down the core abstractions and basic
> APIs and use the project to drive the agenda for kernel development
> (which can also be used in the context of a query engine runtime).
> From my past experience designing pandas and some other in-memory
> analytics projects, I have some idea of the kinds of mistakes or
> design patterns I would like to _avoid_ in this effort, but others may
> have some experiences they can offer to inform the design approach as
> well.
> 
> Looking forward to comments and discussion.
> 
> - Wes
> 


Re: [DISCUSS] Developing a "data frame" subproject in the Arrow C++ libraries

2019-05-21 Thread Wes McKinney
Comments are on now, sorry about that.

On Tue, May 21, 2019, 1:06 AM Micah Kornfield  wrote:

> Hi Wes,
> It looks like comments are turned off on the doc, this intentional?
>
> Thanks,
> Micah
>
> On Mon, May 20, 2019 at 3:49 PM Wes McKinney  wrote:
>
> > hi folks,
> >
> > I'm interested in starting to build a so-called "data frame" interface
> > as a moderately opinionated, higher-level usability layer for
> > interacting with Arrow-based chunked in-memory data. I've had numerous
> > discussions (mostly in-person) over the last few years about this and
> > it feels to me that if we don't build something like this in Apache
> > Arrow that we could end up with several third party efforts without
> > much community discussion or collaboration, which would be sad.
> >
> > Another anti-pattern that is occurring is that users are loading data
> > into Arrow, converting to a library like pandas in order to do some
> > simple in-memory data manipulations, then converting back to Arrow.
> > This is not the intended long term mode of operation.
> >
> > I wrote in significantly more detail (~7-8 pages) about the context
> > and motivation for this project:
> >
> >
> >
> https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit?usp=sharing
> >
> > Note that this would be a parallel effort to go alongside the
> > previously-discussed "Query Engine" project, and the two things are
> > intended to work together. Since we are creating computational
> > kernels, this would also provide some immediacy in being able to
> > invoke kernels easily on large in-memory datasets without having to
> > wait for a more full-fledged query engine system to be developed
> >
> > The details with these kinds of projects can be bedeviling so my
> > approach would be to begin to lay down the core abstractions and basic
> > APIs and use the project to drive the agenda for kernel development
> > (which can also be used in the context of a query engine runtime).
> > From my past experience designing pandas and some other in-memory
> > analytics projects, I have some idea of the kinds of mistakes or
> > design patterns I would like to _avoid_ in this effort, but others may
> > have some experiences they can offer to inform the design approach as
> > well.
> >
> > Looking forward to comments and discussion.
> >
> > - Wes
> >
>


Re: [DISCUSS] Developing a "data frame" subproject in the Arrow C++ libraries

2019-05-21 Thread Wes McKinney
hi Antoine,

On Tue, May 21, 2019 at 5:48 AM Antoine Pitrou  wrote:
>
>
> Hi Wes,
>
> How does copy-on-write play together with memory-mapped data?  It seems
> that, depending on whether the memory map has several concurrent users
> (a condition which may be timing-dependent), we will either persist
> changes on disk or make them ephemeral in memory.  That doesn't sound
> very user-friendly, IMHO.

With memory-mapping, any Buffer is sliced from the parent MemoryMap
[1] so mutating the data on disk using this interface wouldn't be
possible with the way that I've framed it.

Note that memory-mapping at all is already significantly advanced over
what most people in the world are using every day. You won't find
examples of memory-mapping with pandas in my book, for example,
because it's not possible. So if you memory-map, perform some
analytics on the mapped data (causing results to be materialized in
memory), then write out the results to a new file (or set of files),
that would be an innovation for most users.

[1]: https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/file.cc#L353

>
> Regards
>
> Antoine.
>
>
> Le 21/05/2019 à 00:39, Wes McKinney a écrit :
> > hi folks,
> >
> > I'm interested in starting to build a so-called "data frame" interface
> > as a moderately opinionated, higher-level usability layer for
> > interacting with Arrow-based chunked in-memory data. I've had numerous
> > discussions (mostly in-person) over the last few years about this and
> > it feels to me that if we don't build something like this in Apache
> > Arrow that we could end up with several third party efforts without
> > much community discussion or collaboration, which would be sad.
> >
> > Another anti-pattern that is occurring is that users are loading data
> > into Arrow, converting to a library like pandas in order to do some
> > simple in-memory data manipulations, then converting back to Arrow.
> > This is not the intended long term mode of operation.
> >
> > I wrote in significantly more detail (~7-8 pages) about the context
> > and motivation for this project:
> >
> > https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit?usp=sharing
> >
> > Note that this would be a parallel effort to go alongside the
> > previously-discussed "Query Engine" project, and the two things are
> > intended to work together. Since we are creating computational
> > kernels, this would also provide some immediacy in being able to
> > invoke kernels easily on large in-memory datasets without having to
> > wait for a more full-fledged query engine system to be developed
> >
> > The details with these kinds of projects can be bedeviling so my
> > approach would be to begin to lay down the core abstractions and basic
> > APIs and use the project to drive the agenda for kernel development
> > (which can also be used in the context of a query engine runtime).
> > From my past experience designing pandas and some other in-memory
> > analytics projects, I have some idea of the kinds of mistakes or
> > design patterns I would like to _avoid_ in this effort, but others may
> > have some experiences they can offer to inform the design approach as
> > well.
> >
> > Looking forward to comments and discussion.
> >
> > - Wes
> >


Re: [DISCUSS] Developing a "data frame" subproject in the Arrow C++ libraries

2019-05-21 Thread Antoine Pitrou


Le 21/05/2019 à 13:42, Wes McKinney a écrit :
> hi Antoine,
> 
> On Tue, May 21, 2019 at 5:48 AM Antoine Pitrou  wrote:
>>
>>
>> Hi Wes,
>>
>> How does copy-on-write play together with memory-mapped data?  It seems
>> that, depending on whether the memory map has several concurrent users
>> (a condition which may be timing-dependent), we will either persist
>> changes on disk or make them ephemeral in memory.  That doesn't sound
>> very user-friendly, IMHO.
> 
> With memory-mapping, any Buffer is sliced from the parent MemoryMap
> [1] so mutating the data on disk using this interface wouldn't be
> possible with the way that I've framed it.

Hmm... I always forget that SliceBuffer returns a read-only view.

Regards

Antoine.


Re: [DISCUSS] Developing a "data frame" subproject in the Arrow C++ libraries

2019-05-21 Thread Wes McKinney
On Tue, May 21, 2019, 8:43 AM Antoine Pitrou  wrote:

>
> Le 21/05/2019 à 13:42, Wes McKinney a écrit :
> > hi Antoine,
> >
> > On Tue, May 21, 2019 at 5:48 AM Antoine Pitrou 
> wrote:
> >>
> >>
> >> Hi Wes,
> >>
> >> How does copy-on-write play together with memory-mapped data?  It seems
> >> that, depending on whether the memory map has several concurrent users
> >> (a condition which may be timing-dependent), we will either persist
> >> changes on disk or make them ephemeral in memory.  That doesn't sound
> >> very user-friendly, IMHO.
> >
> > With memory-mapping, any Buffer is sliced from the parent MemoryMap
> > [1] so mutating the data on disk using this interface wouldn't be
> > possible with the way that I've framed it.
>
> Hmm... I always forget that SliceBuffer returns a read-only view.
>

The more important issue is that parent_ is non-null. The idea is that no
mutation is allowed if we reason that another Buffer object has access to
the address space of interest. I think this style of copy-on-write is a
reasonable compromise that prevents most kinds of defensive copying.


> Regards
>
> Antoine.
>