Thanks Jacques,

Not what I had hoped, but assuming that I have some other mechanism for
telling the reader which rows are junk, it seems like there is a follow-up
question regarding adherence to specification for variable-width strings:

Suppose I have 100 bytes for string storage and a vector of offsets into
it.  I initialize a three-row vector as follows:

0 -> 100
1 -> 100
2 -> 100

This would encode three length zero strings, correct?  The 100 bytes would
be unused, but the specification hopefully does not care?

Writing the first string, “hi”, would result in:
0 -> 0
1 -> 2
2 -> 100

The reader would possibly slurp in the entire vector, but be told that only
row 0 is valid data.  The string in row 1 of length 98 would be ignored
after read (e.g. pyarrow reading from feather file).

So it seems like I only need to update the row after the row being written
in order to adhere to spec for variable-length strings.  Do you agree?

-John


On Mon, May 6, 2019 at 8:18 AM Jacques Nadeau <jacq...@apache.org> wrote:

> This is more of a question of implementation versus specification. An arrow
> buffer is generally built and then sealed. In different languages, this
> building process works differently (a concern of the language rather than
> the memory specification). We don't currently allow a half built vector to
> be moved to another language and then be further built. So the question is
> really more concrete: what language are you looking at and what is the
> specific pattern you're trying to undertake for building.
>
> If you're trying to go across independent processes (whether the same
> process restarted or two separate processes active simultaneously) you'll
> need to build up your own data structures to help with this.
>
> On Mon, May 6, 2019 at 6:28 PM John Muehlhausen <j...@jgm.org> wrote:
>
> > Hello,
> >
> > Glad to learn of this project— good work!
> >
> > If I allocate a single chunk of memory and start building Arrow format
> > within it, does this chunk save any state regarding my progress?
> >
> > For example, suppose I allocate a column for floating point (fixed width)
> > and a column for string (variable width).  Suppose I start building the
> > floating point column at offset X into my single buffer, and the string
> > “pointer” column at offset Y into the same single buffer, and the string
> > data elements at offset Z.
> >
> > I write one floating point number and one string, then go away.  When I
> > come back to this buffer to append another value, does the buffer itself
> > know where I would begin?  I.e. is there a differentiation in the column
> > (or blob) data itself between the available space and the used space?
> >
> > Suppose I write a lot of large variable width strings and “run out” of
> > space for them before running out of space for floating point numbers or
> > string pointers.  (I guessed badly when doing the original allocation.).
> I
> > consider this to be Ok since I can always “copy” the data to “compress
> out”
> > the unused fp/pointer buckets... the choice is up to me.
> >
> > The above applied to a (feather?) file is how I anticipate appending data
> > to disk... pre-allocate a mem-mapped file and gradually fill it up.  The
> > efficiency of file utilization will depend on my projections regarding
> > variable-width data types, but as I said above, I can always re-write the
> > file if/when this bothers me.
> >
> > Is this the recommended and supported approach for incremental appends?
> > I’m really hoping to use Arrow instead of rolling my own, but
> functionality
> > like this is absolutely key!  Hoping not to use a side-car file (or
> memory
> > chunk) to store “append progress” information.
> >
> > I am brand new to this project so please forgive me if I have overlooked
> > something obvious.  And again, looks like great work so far!
> >
> > Thanks!
> > -John
> >
>

Reply via email to