On Mon, May 13, 2019 at 10:28 AM John Muehlhausen wrote:
>
> ``perhaps the right way forward is to start by gathering a
> number of interested parties and start designing a proposal''
>
> YES! How do we go about this?
>
I'd recommend writing a proposal document (using Google Docs or
whatever
``perhaps the right way forward is to start by gathering a
number of interested parties and start designing a proposal''
YES! How do we go about this?
``There are some early experiments to populate Arrow nodes in microbatches
from Kafka'' (cf link in thread)
Who did this?
-John
On Mon, May
Hi John,
We are strongly committed to backwards compatibility in the Arrow format
specification. You should not fear any compatibility-breaking changes
in the future. People sometimes express uncertainty because we have not
reached 1.0 yet, but that's because we have not yet implemented all
Micah, yes, it all works at the moment. How have we staked out that it
will always work in the future as people continue to work on the spec? That
is my concern.
Also, it would be extremely useful if someone opening a file had my nil
rows hidden from them without needing to analyze the
This is already implicit in the spec because there it requires 8 byte
alignment and padding bit recommends 64. I'd be ok updating the spec to
explicitly state buffers might be oversized but I agree with Wes I don't
think a format change is warranted.
On Mon, May 13, 2019 at 6:29 AM John
Furthermore, we already have a "custom_metadata" field on Message
where you could indicate that a RecordBatch is underfilled; there's no
need to change the protocol
https://github.com/apache/arrow/blob/master/format/Message.fbs#L98
On Mon, May 13, 2019 at 8:30 AM Micah Kornfield wrote:
>
> Hi
Hi John,
To expand on this I don't think there is anything preventing you in the
current spec from over provisioning the underlying buffers. So you can
effectively split "capacity" from "length" by subtracting the size of the
buffer from the amount of space taken by the rows indicated in the
Thanks Wes, do you have any comment on the following from the zdnet story I
linked?
``But the missing piece is streaming, where the velocity of incoming data
poses a special challenge. There are some early experiments to populate
Arrow nodes in microbatches from Kafka. And, as the edge gets
hi John,
Sorry, there's a number of fairly long e-mails in this thread; I'm
having a hard time following all of the details.
I suspect the most parsimonious thing would be to have some "sidecar"
metadata that tracks the state of your writes into pre-allocated Arrow
blocks so that readers know to
Any thoughts on a RecordBatch distinguishing size from capacity? (To borrow
std::vector terminology)
Thanks,
John
On Thu, May 9, 2019 at 2:46 PM John Muehlhausen wrote:
> Wes et al, I think my core proposal is that Message.fbs:RecordBatch split
> the "length" parameter into "theoretical max
Wes et al, I think my core proposal is that Message.fbs:RecordBatch split
the "length" parameter into "theoretical max length" and "utilized length"
(perhaps not those exact names).
"theoretical max length is the same as "length" now ... /// ...The arrays
in the batch should all have this
On Tue, May 7, 2019 at 12:26 PM John Muehlhausen wrote:
>
> Wes, are we saying that `pa.ipc.open_file(...).read_pandas()` already reads
> the future Feather format? If not, how will the future format differ? I
> will work on my access pattern with this format instead of the current
> feather
Wes, are we saying that `pa.ipc.open_file(...).read_pandas()` already reads
the future Feather format? If not, how will the future format differ? I
will work on my access pattern with this format instead of the current
feather format. Sorry I was not clear on that earlier.
Micah, thank you!
On
Hi John,
To give a specific pointer [1] describes how the streaming protocol is
stored to a file.
[1] https://arrow.apache.org/docs/format/IPC.html#file-format
On Tue, May 7, 2019 at 9:40 AM Wes McKinney wrote:
> hi John,
>
> As soon as the R folks can install the Arrow R package consistently,
Thanks Wes:
"the current Feather format is deprecated" ... yes, but there will be a
future file format that replaces it, correct? And my discussion of
immutable "portions" of Arrow buffers, rather than immutability of the
entire buffer, applies to IPC as well, right? I am only championing the
hi John,
On Tue, May 7, 2019 at 10:53 AM John Muehlhausen wrote:
>
> Wes et al, I completed a preliminary study of populating a Feather file
> incrementally. Some notes and questions:
>
> I wrote the following dataframe to a feather file:
> ab
> 0 0123456789 0.0
> 1
Wes et al, I completed a preliminary study of populating a Feather file
incrementally. Some notes and questions:
I wrote the following dataframe to a feather file:
ab
0 0123456789 0.0
1 0123456789 NaN
2 0123456789 NaN
3 0123456789 NaN
4None NaN
In re-writing the
hi John -- again, I would caution you against using Feather files for
issues of longevity -- the internal memory layout of those files is a
"dead man walking" so to speak.
I would advise against forking the project, IMHO that is a dark path
that leads nowhere good. We have a large community here
François, Wes,
Thanks for the feedback. I think the most practical thing for me to do is
1- write a Feather file that is structured to pre-allocate the space I need
(e.g. initial variable-length strings are of average size)
2- come up with code to monkey around with the values contained in the
Hello John,
Arrow is not yet suited for partial writes. The specification only
talks about fully frozen/immutable objects, you're in implementation
defined territory here. For example, the C++ library assumes the Array
object is immutable; it memoize the null count, and likely more
statistics in
hi John,
Feel free to open some JIRA issues to make a specific proposal about
what you want to see in the libraries
I would recommend not coupling yourself to the Feather format as it
stands now, as I would like to change it as soon as > 90% of R users
can successfully install the Arrow
Wes,
I’m not afraid of writing my own C++ code to deal with all of this on the
writer side. I just need a way to “append” (incrementally populate) e.g.
feather files so that a person using e.g. pyarrow doesn’t suffer some
catastrophic failure... and “on the side” I tell them which rows are junk
Thanks Jacques,
Not what I had hoped, but assuming that I have some other mechanism for
telling the reader which rows are junk, it seems like there is a follow-up
question regarding adherence to specification for variable-width strings:
Suppose I have 100 bytes for string storage and a vector of
hi John,
In C++ the builder classes don't yet support writing into preallocated
memory. It would be tricky for applications to determine a priori
which segments of memory to pass to the builder. It seems only
feasible for primitive / fixed-size types so my guess would be that a
separate set of
This is more of a question of implementation versus specification. An arrow
buffer is generally built and then sealed. In different languages, this
building process works differently (a concern of the language rather than
the memory specification). We don't currently allow a half built vector to
Hello,
Glad to learn of this project— good work!
If I allocate a single chunk of memory and start building Arrow format
within it, does this chunk save any state regarding my progress?
For example, suppose I allocate a column for floating point (fixed width)
and a column for string (variable
26 matches
Mail list logo