Thanks Jacques, Not what I had hoped, but assuming that I have some other mechanism for telling the reader which rows are junk, it seems like there is a follow-up question regarding adherence to specification for variable-width strings:
Suppose I have 100 bytes for string storage and a vector of offsets into it. I initialize a three-row vector as follows: 0 -> 100 1 -> 100 2 -> 100 This would encode three length zero strings, correct? The 100 bytes would be unused, but the specification hopefully does not care? Writing the first string, “hi”, would result in: 0 -> 0 1 -> 2 2 -> 100 The reader would possibly slurp in the entire vector, but be told that only row 0 is valid data. The string in row 1 of length 98 would be ignored after read (e.g. pyarrow reading from feather file). So it seems like I only need to update the row after the row being written in order to adhere to spec for variable-length strings. Do you agree? -John On Mon, May 6, 2019 at 8:18 AM Jacques Nadeau <jacq...@apache.org> wrote: > This is more of a question of implementation versus specification. An arrow > buffer is generally built and then sealed. In different languages, this > building process works differently (a concern of the language rather than > the memory specification). We don't currently allow a half built vector to > be moved to another language and then be further built. So the question is > really more concrete: what language are you looking at and what is the > specific pattern you're trying to undertake for building. > > If you're trying to go across independent processes (whether the same > process restarted or two separate processes active simultaneously) you'll > need to build up your own data structures to help with this. > > On Mon, May 6, 2019 at 6:28 PM John Muehlhausen <j...@jgm.org> wrote: > > > Hello, > > > > Glad to learn of this project— good work! > > > > If I allocate a single chunk of memory and start building Arrow format > > within it, does this chunk save any state regarding my progress? > > > > For example, suppose I allocate a column for floating point (fixed width) > > and a column for string (variable width). Suppose I start building the > > floating point column at offset X into my single buffer, and the string > > “pointer” column at offset Y into the same single buffer, and the string > > data elements at offset Z. > > > > I write one floating point number and one string, then go away. When I > > come back to this buffer to append another value, does the buffer itself > > know where I would begin? I.e. is there a differentiation in the column > > (or blob) data itself between the available space and the used space? > > > > Suppose I write a lot of large variable width strings and “run out” of > > space for them before running out of space for floating point numbers or > > string pointers. (I guessed badly when doing the original allocation.). > I > > consider this to be Ok since I can always “copy” the data to “compress > out” > > the unused fp/pointer buckets... the choice is up to me. > > > > The above applied to a (feather?) file is how I anticipate appending data > > to disk... pre-allocate a mem-mapped file and gradually fill it up. The > > efficiency of file utilization will depend on my projections regarding > > variable-width data types, but as I said above, I can always re-write the > > file if/when this bothers me. > > > > Is this the recommended and supported approach for incremental appends? > > I’m really hoping to use Arrow instead of rolling my own, but > functionality > > like this is absolutely key! Hoping not to use a side-car file (or > memory > > chunk) to store “append progress” information. > > > > I am brand new to this project so please forgive me if I have overlooked > > something obvious. And again, looks like great work so far! > > > > Thanks! > > -John > > >