[DISCUSS] Developing a "data frame" subproject in the Arrow C++ libraries
hi folks, I'm interested in starting to build a so-called "data frame" interface as a moderately opinionated, higher-level usability layer for interacting with Arrow-based chunked in-memory data. I've had numerous discussions (mostly in-person) over the last few years about this and it feels to me that if we don't build something like this in Apache Arrow that we could end up with several third party efforts without much community discussion or collaboration, which would be sad. Another anti-pattern that is occurring is that users are loading data into Arrow, converting to a library like pandas in order to do some simple in-memory data manipulations, then converting back to Arrow. This is not the intended long term mode of operation. I wrote in significantly more detail (~7-8 pages) about the context and motivation for this project: https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit?usp=sharing Note that this would be a parallel effort to go alongside the previously-discussed "Query Engine" project, and the two things are intended to work together. Since we are creating computational kernels, this would also provide some immediacy in being able to invoke kernels easily on large in-memory datasets without having to wait for a more full-fledged query engine system to be developed The details with these kinds of projects can be bedeviling so my approach would be to begin to lay down the core abstractions and basic APIs and use the project to drive the agenda for kernel development (which can also be used in the context of a query engine runtime). >From my past experience designing pandas and some other in-memory analytics projects, I have some idea of the kinds of mistakes or design patterns I would like to _avoid_ in this effort, but others may have some experiences they can offer to inform the design approach as well. Looking forward to comments and discussion. - Wes
RE: [DISCUSS] Developing a "data frame" subproject in the Arrow C++ libraries
Hey Wes, I just wanted to check-in on this work. Have there been any updates to the Arrow "data frame" project worth sharing? Thanks, Eric -Original Message- From: Wes McKinney Sent: Tuesday, May 21, 2019 8:17 AM To: dev@arrow.apache.org Subject: Re: [DISCUSS] Developing a "data frame" subproject in the Arrow C++ libraries On Tue, May 21, 2019, 8:43 AM Antoine Pitrou wrote: > > Le 21/05/2019 à 13:42, Wes McKinney a écrit : > > hi Antoine, > > > > On Tue, May 21, 2019 at 5:48 AM Antoine Pitrou > wrote: > >> > >> > >> Hi Wes, > >> > >> How does copy-on-write play together with memory-mapped data? It > >> seems that, depending on whether the memory map has several > >> concurrent users (a condition which may be timing-dependent), we > >> will either persist changes on disk or make them ephemeral in > >> memory. That doesn't sound very user-friendly, IMHO. > > > > With memory-mapping, any Buffer is sliced from the parent MemoryMap > > [1] so mutating the data on disk using this interface wouldn't be > > possible with the way that I've framed it. > > Hmm... I always forget that SliceBuffer returns a read-only view. > The more important issue is that parent_ is non-null. The idea is that no mutation is allowed if we reason that another Buffer object has access to the address space of interest. I think this style of copy-on-write is a reasonable compromise that prevents most kinds of defensive copying. > Regards > > Antoine. >
Re: [DISCUSS] Developing a "data frame" subproject in the Arrow C++ libraries
hi Eric -- there have not been any patches yet related to it. I'm currently in the midst of some internal restructuring of the Parquet C++ library to address long-standing efficiency and memory use issues. It's my intention to spend time on the data frame project as one of my next focus areas, likely to be after Labor Day. - Wes On Mon, Aug 12, 2019 at 10:28 AM Eric Erhardt wrote: > > Hey Wes, > > I just wanted to check-in on this work. Have there been any updates to the > Arrow "data frame" project worth sharing? > > Thanks, > Eric > > -Original Message- > From: Wes McKinney > Sent: Tuesday, May 21, 2019 8:17 AM > To: dev@arrow.apache.org > Subject: Re: [DISCUSS] Developing a "data frame" subproject in the Arrow C++ > libraries > > On Tue, May 21, 2019, 8:43 AM Antoine Pitrou wrote: > > > > > Le 21/05/2019 à 13:42, Wes McKinney a écrit : > > > hi Antoine, > > > > > > On Tue, May 21, 2019 at 5:48 AM Antoine Pitrou > > wrote: > > >> > > >> > > >> Hi Wes, > > >> > > >> How does copy-on-write play together with memory-mapped data? It > > >> seems that, depending on whether the memory map has several > > >> concurrent users (a condition which may be timing-dependent), we > > >> will either persist changes on disk or make them ephemeral in > > >> memory. That doesn't sound very user-friendly, IMHO. > > > > > > With memory-mapping, any Buffer is sliced from the parent MemoryMap > > > [1] so mutating the data on disk using this interface wouldn't be > > > possible with the way that I've framed it. > > > > Hmm... I always forget that SliceBuffer returns a read-only view. > > > > The more important issue is that parent_ is non-null. The idea is that no > mutation is allowed if we reason that another Buffer object has access to the > address space of interest. I think this style of copy-on-write is a > reasonable compromise that prevents most kinds of defensive copying. > > > > Regards > > > > Antoine. > >
Re: [DISCUSS] Developing a "data frame" subproject in the Arrow C++ libraries
Hi Wes, It looks like comments are turned off on the doc, this intentional? Thanks, Micah On Mon, May 20, 2019 at 3:49 PM Wes McKinney wrote: > hi folks, > > I'm interested in starting to build a so-called "data frame" interface > as a moderately opinionated, higher-level usability layer for > interacting with Arrow-based chunked in-memory data. I've had numerous > discussions (mostly in-person) over the last few years about this and > it feels to me that if we don't build something like this in Apache > Arrow that we could end up with several third party efforts without > much community discussion or collaboration, which would be sad. > > Another anti-pattern that is occurring is that users are loading data > into Arrow, converting to a library like pandas in order to do some > simple in-memory data manipulations, then converting back to Arrow. > This is not the intended long term mode of operation. > > I wrote in significantly more detail (~7-8 pages) about the context > and motivation for this project: > > > https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit?usp=sharing > > Note that this would be a parallel effort to go alongside the > previously-discussed "Query Engine" project, and the two things are > intended to work together. Since we are creating computational > kernels, this would also provide some immediacy in being able to > invoke kernels easily on large in-memory datasets without having to > wait for a more full-fledged query engine system to be developed > > The details with these kinds of projects can be bedeviling so my > approach would be to begin to lay down the core abstractions and basic > APIs and use the project to drive the agenda for kernel development > (which can also be used in the context of a query engine runtime). > From my past experience designing pandas and some other in-memory > analytics projects, I have some idea of the kinds of mistakes or > design patterns I would like to _avoid_ in this effort, but others may > have some experiences they can offer to inform the design approach as > well. > > Looking forward to comments and discussion. > > - Wes >
Re: [DISCUSS] Developing a "data frame" subproject in the Arrow C++ libraries
Hi Wes, How does copy-on-write play together with memory-mapped data? It seems that, depending on whether the memory map has several concurrent users (a condition which may be timing-dependent), we will either persist changes on disk or make them ephemeral in memory. That doesn't sound very user-friendly, IMHO. Regards Antoine. Le 21/05/2019 à 00:39, Wes McKinney a écrit : > hi folks, > > I'm interested in starting to build a so-called "data frame" interface > as a moderately opinionated, higher-level usability layer for > interacting with Arrow-based chunked in-memory data. I've had numerous > discussions (mostly in-person) over the last few years about this and > it feels to me that if we don't build something like this in Apache > Arrow that we could end up with several third party efforts without > much community discussion or collaboration, which would be sad. > > Another anti-pattern that is occurring is that users are loading data > into Arrow, converting to a library like pandas in order to do some > simple in-memory data manipulations, then converting back to Arrow. > This is not the intended long term mode of operation. > > I wrote in significantly more detail (~7-8 pages) about the context > and motivation for this project: > > https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit?usp=sharing > > Note that this would be a parallel effort to go alongside the > previously-discussed "Query Engine" project, and the two things are > intended to work together. Since we are creating computational > kernels, this would also provide some immediacy in being able to > invoke kernels easily on large in-memory datasets without having to > wait for a more full-fledged query engine system to be developed > > The details with these kinds of projects can be bedeviling so my > approach would be to begin to lay down the core abstractions and basic > APIs and use the project to drive the agenda for kernel development > (which can also be used in the context of a query engine runtime). > From my past experience designing pandas and some other in-memory > analytics projects, I have some idea of the kinds of mistakes or > design patterns I would like to _avoid_ in this effort, but others may > have some experiences they can offer to inform the design approach as > well. > > Looking forward to comments and discussion. > > - Wes >
Re: [DISCUSS] Developing a "data frame" subproject in the Arrow C++ libraries
Comments are on now, sorry about that. On Tue, May 21, 2019, 1:06 AM Micah Kornfield wrote: > Hi Wes, > It looks like comments are turned off on the doc, this intentional? > > Thanks, > Micah > > On Mon, May 20, 2019 at 3:49 PM Wes McKinney wrote: > > > hi folks, > > > > I'm interested in starting to build a so-called "data frame" interface > > as a moderately opinionated, higher-level usability layer for > > interacting with Arrow-based chunked in-memory data. I've had numerous > > discussions (mostly in-person) over the last few years about this and > > it feels to me that if we don't build something like this in Apache > > Arrow that we could end up with several third party efforts without > > much community discussion or collaboration, which would be sad. > > > > Another anti-pattern that is occurring is that users are loading data > > into Arrow, converting to a library like pandas in order to do some > > simple in-memory data manipulations, then converting back to Arrow. > > This is not the intended long term mode of operation. > > > > I wrote in significantly more detail (~7-8 pages) about the context > > and motivation for this project: > > > > > > > https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit?usp=sharing > > > > Note that this would be a parallel effort to go alongside the > > previously-discussed "Query Engine" project, and the two things are > > intended to work together. Since we are creating computational > > kernels, this would also provide some immediacy in being able to > > invoke kernels easily on large in-memory datasets without having to > > wait for a more full-fledged query engine system to be developed > > > > The details with these kinds of projects can be bedeviling so my > > approach would be to begin to lay down the core abstractions and basic > > APIs and use the project to drive the agenda for kernel development > > (which can also be used in the context of a query engine runtime). > > From my past experience designing pandas and some other in-memory > > analytics projects, I have some idea of the kinds of mistakes or > > design patterns I would like to _avoid_ in this effort, but others may > > have some experiences they can offer to inform the design approach as > > well. > > > > Looking forward to comments and discussion. > > > > - Wes > > >
Re: [DISCUSS] Developing a "data frame" subproject in the Arrow C++ libraries
hi Antoine, On Tue, May 21, 2019 at 5:48 AM Antoine Pitrou wrote: > > > Hi Wes, > > How does copy-on-write play together with memory-mapped data? It seems > that, depending on whether the memory map has several concurrent users > (a condition which may be timing-dependent), we will either persist > changes on disk or make them ephemeral in memory. That doesn't sound > very user-friendly, IMHO. With memory-mapping, any Buffer is sliced from the parent MemoryMap [1] so mutating the data on disk using this interface wouldn't be possible with the way that I've framed it. Note that memory-mapping at all is already significantly advanced over what most people in the world are using every day. You won't find examples of memory-mapping with pandas in my book, for example, because it's not possible. So if you memory-map, perform some analytics on the mapped data (causing results to be materialized in memory), then write out the results to a new file (or set of files), that would be an innovation for most users. [1]: https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/file.cc#L353 > > Regards > > Antoine. > > > Le 21/05/2019 à 00:39, Wes McKinney a écrit : > > hi folks, > > > > I'm interested in starting to build a so-called "data frame" interface > > as a moderately opinionated, higher-level usability layer for > > interacting with Arrow-based chunked in-memory data. I've had numerous > > discussions (mostly in-person) over the last few years about this and > > it feels to me that if we don't build something like this in Apache > > Arrow that we could end up with several third party efforts without > > much community discussion or collaboration, which would be sad. > > > > Another anti-pattern that is occurring is that users are loading data > > into Arrow, converting to a library like pandas in order to do some > > simple in-memory data manipulations, then converting back to Arrow. > > This is not the intended long term mode of operation. > > > > I wrote in significantly more detail (~7-8 pages) about the context > > and motivation for this project: > > > > https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit?usp=sharing > > > > Note that this would be a parallel effort to go alongside the > > previously-discussed "Query Engine" project, and the two things are > > intended to work together. Since we are creating computational > > kernels, this would also provide some immediacy in being able to > > invoke kernels easily on large in-memory datasets without having to > > wait for a more full-fledged query engine system to be developed > > > > The details with these kinds of projects can be bedeviling so my > > approach would be to begin to lay down the core abstractions and basic > > APIs and use the project to drive the agenda for kernel development > > (which can also be used in the context of a query engine runtime). > > From my past experience designing pandas and some other in-memory > > analytics projects, I have some idea of the kinds of mistakes or > > design patterns I would like to _avoid_ in this effort, but others may > > have some experiences they can offer to inform the design approach as > > well. > > > > Looking forward to comments and discussion. > > > > - Wes > >
Re: [DISCUSS] Developing a "data frame" subproject in the Arrow C++ libraries
Le 21/05/2019 à 13:42, Wes McKinney a écrit : > hi Antoine, > > On Tue, May 21, 2019 at 5:48 AM Antoine Pitrou wrote: >> >> >> Hi Wes, >> >> How does copy-on-write play together with memory-mapped data? It seems >> that, depending on whether the memory map has several concurrent users >> (a condition which may be timing-dependent), we will either persist >> changes on disk or make them ephemeral in memory. That doesn't sound >> very user-friendly, IMHO. > > With memory-mapping, any Buffer is sliced from the parent MemoryMap > [1] so mutating the data on disk using this interface wouldn't be > possible with the way that I've framed it. Hmm... I always forget that SliceBuffer returns a read-only view. Regards Antoine.
Re: [DISCUSS] Developing a "data frame" subproject in the Arrow C++ libraries
On Tue, May 21, 2019, 8:43 AM Antoine Pitrou wrote: > > Le 21/05/2019 à 13:42, Wes McKinney a écrit : > > hi Antoine, > > > > On Tue, May 21, 2019 at 5:48 AM Antoine Pitrou > wrote: > >> > >> > >> Hi Wes, > >> > >> How does copy-on-write play together with memory-mapped data? It seems > >> that, depending on whether the memory map has several concurrent users > >> (a condition which may be timing-dependent), we will either persist > >> changes on disk or make them ephemeral in memory. That doesn't sound > >> very user-friendly, IMHO. > > > > With memory-mapping, any Buffer is sliced from the parent MemoryMap > > [1] so mutating the data on disk using this interface wouldn't be > > possible with the way that I've framed it. > > Hmm... I always forget that SliceBuffer returns a read-only view. > The more important issue is that parent_ is non-null. The idea is that no mutation is allowed if we reason that another Buffer object has access to the address space of interest. I think this style of copy-on-write is a reasonable compromise that prevents most kinds of defensive copying. > Regards > > Antoine. >