Re: Stored state of incremental writes to fixed size Arrow buffer?

2019-05-13 Thread Wes McKinney
On Mon, May 13, 2019 at 10:28 AM John Muehlhausen wrote: > > ``perhaps the right way forward is to start by gathering a > number of interested parties and start designing a proposal'' > > YES! How do we go about this? > I'd recommend writing a proposal document (using Google Docs or whatever

Re: Stored state of incremental writes to fixed size Arrow buffer?

2019-05-13 Thread John Muehlhausen
``perhaps the right way forward is to start by gathering a number of interested parties and start designing a proposal'' YES! How do we go about this? ``There are some early experiments to populate Arrow nodes in microbatches from Kafka'' (cf link in thread) Who did this? -John On Mon, May

Re: Stored state of incremental writes to fixed size Arrow buffer?

2019-05-13 Thread Antoine Pitrou
Hi John, We are strongly committed to backwards compatibility in the Arrow format specification. You should not fear any compatibility-breaking changes in the future. People sometimes express uncertainty because we have not reached 1.0 yet, but that's because we have not yet implemented all

Re: Stored state of incremental writes to fixed size Arrow buffer?

2019-05-13 Thread John Muehlhausen
Micah, yes, it all works at the moment. How have we staked out that it will always work in the future as people continue to work on the spec? That is my concern. Also, it would be extremely useful if someone opening a file had my nil rows hidden from them without needing to analyze the

Re: Stored state of incremental writes to fixed size Arrow buffer?

2019-05-13 Thread Micah Kornfield
This is already implicit in the spec because there it requires 8 byte alignment and padding bit recommends 64. I'd be ok updating the spec to explicitly state buffers might be oversized but I agree with Wes I don't think a format change is warranted. On Mon, May 13, 2019 at 6:29 AM John

Re: Stored state of incremental writes to fixed size Arrow buffer?

2019-05-13 Thread Wes McKinney
Furthermore, we already have a "custom_metadata" field on Message where you could indicate that a RecordBatch is underfilled; there's no need to change the protocol https://github.com/apache/arrow/blob/master/format/Message.fbs#L98 On Mon, May 13, 2019 at 8:30 AM Micah Kornfield wrote: > > Hi

Re: Stored state of incremental writes to fixed size Arrow buffer?

2019-05-13 Thread Micah Kornfield
Hi John, To expand on this I don't think there is anything preventing you in the current spec from over provisioning the underlying buffers. So you can effectively split "capacity" from "length" by subtracting the size of the buffer from the amount of space taken by the rows indicated in the

Re: Stored state of incremental writes to fixed size Arrow buffer?

2019-05-13 Thread John Muehlhausen
Thanks Wes, do you have any comment on the following from the zdnet story I linked? ``But the missing piece is streaming, where the velocity of incoming data poses a special challenge. There are some early experiments to populate Arrow nodes in microbatches from Kafka. And, as the edge gets

Re: Stored state of incremental writes to fixed size Arrow buffer?

2019-05-13 Thread Wes McKinney
hi John, Sorry, there's a number of fairly long e-mails in this thread; I'm having a hard time following all of the details. I suspect the most parsimonious thing would be to have some "sidecar" metadata that tracks the state of your writes into pre-allocated Arrow blocks so that readers know to

Re: Stored state of incremental writes to fixed size Arrow buffer?

2019-05-13 Thread John Muehlhausen
Any thoughts on a RecordBatch distinguishing size from capacity? (To borrow std::vector terminology) Thanks, John On Thu, May 9, 2019 at 2:46 PM John Muehlhausen wrote: > Wes et al, I think my core proposal is that Message.fbs:RecordBatch split > the "length" parameter into "theoretical max

Re: Stored state of incremental writes to fixed size Arrow buffer?

2019-05-09 Thread John Muehlhausen
Wes et al, I think my core proposal is that Message.fbs:RecordBatch split the "length" parameter into "theoretical max length" and "utilized length" (perhaps not those exact names). "theoretical max length is the same as "length" now ... /// ...The arrays in the batch should all have this

Re: Stored state of incremental writes to fixed size Arrow buffer?

2019-05-07 Thread Wes McKinney
On Tue, May 7, 2019 at 12:26 PM John Muehlhausen wrote: > > Wes, are we saying that `pa.ipc.open_file(...).read_pandas()` already reads > the future Feather format? If not, how will the future format differ? I > will work on my access pattern with this format instead of the current > feather

Re: Stored state of incremental writes to fixed size Arrow buffer?

2019-05-07 Thread John Muehlhausen
Wes, are we saying that `pa.ipc.open_file(...).read_pandas()` already reads the future Feather format? If not, how will the future format differ? I will work on my access pattern with this format instead of the current feather format. Sorry I was not clear on that earlier. Micah, thank you! On

Re: Stored state of incremental writes to fixed size Arrow buffer?

2019-05-07 Thread Micah Kornfield
Hi John, To give a specific pointer [1] describes how the streaming protocol is stored to a file. [1] https://arrow.apache.org/docs/format/IPC.html#file-format On Tue, May 7, 2019 at 9:40 AM Wes McKinney wrote: > hi John, > > As soon as the R folks can install the Arrow R package consistently,

Re: Stored state of incremental writes to fixed size Arrow buffer?

2019-05-07 Thread John Muehlhausen
Thanks Wes: "the current Feather format is deprecated" ... yes, but there will be a future file format that replaces it, correct? And my discussion of immutable "portions" of Arrow buffers, rather than immutability of the entire buffer, applies to IPC as well, right? I am only championing the

Re: Stored state of incremental writes to fixed size Arrow buffer?

2019-05-07 Thread Wes McKinney
hi John, On Tue, May 7, 2019 at 10:53 AM John Muehlhausen wrote: > > Wes et al, I completed a preliminary study of populating a Feather file > incrementally. Some notes and questions: > > I wrote the following dataframe to a feather file: > ab > 0 0123456789 0.0 > 1

Re: Stored state of incremental writes to fixed size Arrow buffer?

2019-05-07 Thread John Muehlhausen
Wes et al, I completed a preliminary study of populating a Feather file incrementally. Some notes and questions: I wrote the following dataframe to a feather file: ab 0 0123456789 0.0 1 0123456789 NaN 2 0123456789 NaN 3 0123456789 NaN 4None NaN In re-writing the

Re: Stored state of incremental writes to fixed size Arrow buffer?

2019-05-06 Thread Wes McKinney
hi John -- again, I would caution you against using Feather files for issues of longevity -- the internal memory layout of those files is a "dead man walking" so to speak. I would advise against forking the project, IMHO that is a dark path that leads nowhere good. We have a large community here

Re: Stored state of incremental writes to fixed size Arrow buffer?

2019-05-06 Thread John Muehlhausen
François, Wes, Thanks for the feedback. I think the most practical thing for me to do is 1- write a Feather file that is structured to pre-allocate the space I need (e.g. initial variable-length strings are of average size) 2- come up with code to monkey around with the values contained in the

Re: Stored state of incremental writes to fixed size Arrow buffer?

2019-05-06 Thread Francois Saint-Jacques
Hello John, Arrow is not yet suited for partial writes. The specification only talks about fully frozen/immutable objects, you're in implementation defined territory here. For example, the C++ library assumes the Array object is immutable; it memoize the null count, and likely more statistics in

Re: Stored state of incremental writes to fixed size Arrow buffer?

2019-05-06 Thread Wes McKinney
hi John, Feel free to open some JIRA issues to make a specific proposal about what you want to see in the libraries I would recommend not coupling yourself to the Feather format as it stands now, as I would like to change it as soon as > 90% of R users can successfully install the Arrow

Re: Stored state of incremental writes to fixed size Arrow buffer?

2019-05-06 Thread John Muehlhausen
Wes, I’m not afraid of writing my own C++ code to deal with all of this on the writer side. I just need a way to “append” (incrementally populate) e.g. feather files so that a person using e.g. pyarrow doesn’t suffer some catastrophic failure... and “on the side” I tell them which rows are junk

Re: Stored state of incremental writes to fixed size Arrow buffer?

2019-05-06 Thread John Muehlhausen
Thanks Jacques, Not what I had hoped, but assuming that I have some other mechanism for telling the reader which rows are junk, it seems like there is a follow-up question regarding adherence to specification for variable-width strings: Suppose I have 100 bytes for string storage and a vector of

Re: Stored state of incremental writes to fixed size Arrow buffer?

2019-05-06 Thread Wes McKinney
hi John, In C++ the builder classes don't yet support writing into preallocated memory. It would be tricky for applications to determine a priori which segments of memory to pass to the builder. It seems only feasible for primitive / fixed-size types so my guess would be that a separate set of

Re: Stored state of incremental writes to fixed size Arrow buffer?

2019-05-06 Thread Jacques Nadeau
This is more of a question of implementation versus specification. An arrow buffer is generally built and then sealed. In different languages, this building process works differently (a concern of the language rather than the memory specification). We don't currently allow a half built vector to

Stored state of incremental writes to fixed size Arrow buffer?

2019-05-06 Thread John Muehlhausen
Hello, Glad to learn of this project— good work! If I allocate a single chunk of memory and start building Arrow format within it, does this chunk save any state regarding my progress? For example, suppose I allocate a column for floating point (fixed width) and a column for string (variable