Re: Stored state of incremental writes to fixed size Arrow buffer?

Wes McKinney Tue, 07 May 2019 09:47:48 -0700

hi John,

As soon as the R folks can install the Arrow R package consistently,
the intent is to replace the Feather internals with the plain Arrow
IPC protocol where we have much better platform support all around.


If you'd like to experiment with creating an API for pre-allocating
fixed-size Arrow protocol blocks and then mutating the data and
metadata on disk in-place, please be our guest. We don't have the
tools developed yet to do this for you

- Wes

On Tue, May 7, 2019 at 11:25 AM John Muehlhausen <j...@jgm.org> wrote:
>
> Thanks Wes:
>
> "the current Feather format is deprecated" ... yes, but there will be a
> future file format that replaces it, correct?  And my discussion of
> immutable "portions" of Arrow buffers, rather than immutability of the
> entire buffer, applies to IPC as well, right?  I am only championing the
> idea that this future file format have the convenience that certain
> preallocated rows can be ignored based on a metadata setting.
>
> "I recommend batching your writes" ... this is extremely inefficient and
> adds unacceptable latency, relative to the proposed solution.  Do you
> disagree?  Either I have a batch length of 1 to minimize latency, which
> eliminates columnar advantages on the read side, or else I add latency.
> Neither works, and it seems that a viable alternative is within sight?
>
> If you don't agree that there is a core issue and opportunity here, I'm not
> sure how to better make my case....
>
> -John
>
> On Tue, May 7, 2019 at 11:02 AM Wes McKinney <wesmck...@gmail.com> wrote:
>
> > hi John,
> >
> > On Tue, May 7, 2019 at 10:53 AM John Muehlhausen <j...@jgm.org> wrote:
> > >
> > > Wes et al, I completed a preliminary study of populating a Feather file
> > > incrementally.  Some notes and questions:
> > >
> > > I wrote the following dataframe to a feather file:
> > >             a    b
> > > 0  0123456789  0.0
> > > 1  0123456789  NaN
> > > 2  0123456789  NaN
> > > 3  0123456789  NaN
> > > 4        None  NaN
> > >
> > > In re-writing the flatbuffers metadata (flatc -p doesn't
> > > support --gen-mutable! yuck! C++ to the rescue...), it seems that
> > > read_feather is not affected by NumRows?  It seems to be driven entirely
> > by
> > > the per-column Length values?
> > >
> > > Also, it seems as if one of the primary usages of NullCount is to
> > determine
> > > whether or not a bitfield is present?  In the initialization data above I
> > > was careful to have a null value in each column in order to generate a
> > > bitfield.
> >
> > Per my prior e-mails, the current Feather format is deprecated, so I'm
> > only willing to engage on a discussion of the official Arrow binary
> > protocol that we use for IPC (memory mapping) and RPC (Flight).
> >
> > >
> > > I then wiped the bitfields in the file and set all of the string indices
> > to
> > > one past the end of the blob buffer (all strings empty):
> > >       a   b
> > > 0  None NaN
> > > 1  None NaN
> > > 2  None NaN
> > > 3  None NaN
> > > 4  None NaN
> > >
> > > I then set the first record to some data by consuming some of the string
> > > blob and row 0 and 1 indices, also setting the double:
> > >
> > >                a    b
> > > 0  Hello, world!  5.0
> > > 1           None  NaN
> > > 2           None  NaN
> > > 3           None  NaN
> > > 4           None  NaN
> > >
> > > As mentioned above, NumRows seems to be ignored.  I tried adjusting each
> > > column Length to mask off higher rows and it worked for 4 (hide last row)
> > > but not for less than 4.  I take this to be due to math used to figure
> > out
> > > where the buffers are relative to one another since there is only one
> > > metadata offset for all of: the (optional) bitset, index column and (if
> > > string) blobs.
> > >
> > > Populating subsequent rows would proceed in a similar way until all of
> > the
> > > blob storage has been consumed, which may come before the pre-allocated
> > > rows have been consumed.
> > >
> > > So what does this mean for my desire to incrementally write these
> > > (potentially memory-mapped) pre-allocated files and/or Arrow buffers in
> > > memory?  Some thoughts:
> > >
> > > - If a single value (such as NumRows) were consulted to determine the
> > table
> > > row dimension then updating this single value would serve to tell a
> > reader
> > > which rows are relevant.  Assuming this value is updated after all other
> > > mutations are complete, and assuming that mutations are only appends
> > > (addition of rows), then concurrency control involves only ensuring that
> > > this value is not examined while it is being written.
> > >
> > > - NullCount presents a concurrency problem if someone reads the file
> > after
> > > this field has been updated, but before NumRows has exposed the new
> > record
> > > (or vice versa).  The idea previously mentioned that there will "likely
> > > [be] more statistics in the future" feels like it might be scope creep to
> > > me?  This is a data representation, not a calculation framework?  If
> > > NullCount had its genesis in the optional nature of the bitfield, I would
> > > suggest that perhaps NullCount can be dropped in favor of always
> > supplying
> > > the bitfield, which in any event is already contemplated by the spec:
> > > "Implementations may choose to always allocate one anyway as a matter of
> > > convenience."  If the concern is space savings, Arrow is already an
> > > extremely uncompressed format.  (Compression is something I would also
> > > consider to be scope creep as regards Feather... compressed filesystems
> > can
> > > be employed and there are other compressed dataframe formats.)  However,
> > if
> > > protecting the 4 bytes required to update NowRows turns out to be no
> > easier
> > > than protecting all of the statistical bytes as well as part of the same
> > > "critical section" (locks: yuck!!) then statistics pose no issue.  I
> > have a
> > > feeling that the availability of an atomic write of 4 bytes will depend
> > on
> > > the storage mechanism... memory vs memory map vs write() etc.
> > >
> > > - The elephant in the room appears to be the presumptive binary yes/no on
> > > mutability of Arrow buffers.  Perhaps the thought is that certain batch
> > > processes will be wrecked if anyone anywhere is mutating buffers in any
> > > way.  But keep in mind I am not proposing general mutability, only
> > > appending of new data.  *A huge amount of batch processing that will take
> > > place with Arrow is on time-series data (whether financial or otherwise).
> > > It is only natural that architects will want the minimal impedance
> > mismatch
> > > when it comes time to grow those time series as the events occur going
> > > forward.*  So rather than say that I want "mutable" Arrow buffers, I
> > would
> > > pitch this as a call for "immutable populated areas" of Arrow buffers
> > > combined with the concept that the populated area can grow up to whatever
> > > was preallocated.  This will not affect anyone who has "memoized" a
> > > dimension and wants to continue to consider the then-current data as
> > > immutable... it will be immutable and will always be immutable according
> > to
> > > that then-current dimension.
> > >
> > > Thanks in advance for considering this feedback!  I absolutely require
> > > efficient row-wise growth of an Arrow-like buffer to deal with time
> > series
> > > data in near real time.  I believe that preallocation is (by far) the
> > most
> > > efficient way to accomplish this.  I hope to be able to use Arrow!  If I
> > > cannot use Arrow than I will be using a home-grown Arrow that is
> > identical
> > > except for this feature, which would be very sad!  Even if Arrow itself
> > > could be used in this manner today, I would be hesitant to use it if the
> > > use-case was not protected on a go-forward basis.
> > >
> >
> > I recommend batching your writes and using the Arrow binary streaming
> > protocol so you are only appending to a file rather than mutating
> > previously-written bytes. This use case is well defined and supported
> > in the software already.
> >
> >
> > https://github.com/apache/arrow/blob/master/docs/source/format/IPC.rst#streaming-format
> >
> > - Wes
> >
> > > Of course, I am completely open to alternative ideas and approaches!
> > >
> > > -John
> > >
> > > On Mon, May 6, 2019 at 11:39 AM Wes McKinney <wesmck...@gmail.com>
> > wrote:
> > >
> > > > hi John -- again, I would caution you against using Feather files for
> > > > issues of longevity -- the internal memory layout of those files is a
> > > > "dead man walking" so to speak.
> > > >
> > > > I would advise against forking the project, IMHO that is a dark path
> > > > that leads nowhere good. We have a large community here and we accept
> > > > pull requests -- I think the challenge is going to be defining the use
> > > > case to suitable clarity that a general purpose solution can be
> > > > developed.
> > > >
> > > > - Wes
> > > >
> > > >
> > > > On Mon, May 6, 2019 at 11:16 AM John Muehlhausen <j...@jgm.org> wrote:
> > > > >
> > > > > François, Wes,
> > > > >
> > > > > Thanks for the feedback.  I think the most practical thing for me to
> > do
> > > > is
> > > > > 1- write a Feather file that is structured to pre-allocate the space
> > I
> > > > need
> > > > > (e.g. initial variable-length strings are of average size)
> > > > > 2- come up with code to monkey around with the values contained in
> > the
> > > > > vectors so that before and after each manipulation the file is valid
> > as I
> > > > > walk the rows ... this is a writer that uses memory mapping
> > > > > 3- check back in here once that works, assuming that it does, to see
> > if
> > > > we
> > > > > can bless certain mutation paths
> > > > > 4- if we can't bless certain mutation paths, fork the project for
> > those
> > > > who
> > > > > care more about stream processing?  That would not seem to be ideal
> > as I
> > > > > think mutation in row-order across the data set is relatively low
> > impact
> > > > on
> > > > > the overall design?
> > > > >
> > > > > Thanks again for engaging the topic!
> > > > > -John
> > > > >
> > > > > On Mon, May 6, 2019 at 10:26 AM Francois Saint-Jacques <
> > > > > fsaintjacq...@gmail.com> wrote:
> > > > >
> > > > > > Hello John,
> > > > > >
> > > > > > Arrow is not yet suited for partial writes. The specification only
> > > > > > talks about fully frozen/immutable objects, you're in
> > implementation
> > > > > > defined territory here. For example, the C++ library assumes the
> > Array
> > > > > > object is immutable; it memoize the null count, and likely more
> > > > > > statistics in the future.
> > > > > >
> > > > > > If you want to use pre-allocated buffers and array, you can use the
> > > > > > column validity bitmap for this purpose, e.g. set all null by
> > default
> > > > > > and flip once the row is written. It suffers from concurrency
> > issues
> > > > > > (+ invalidation issues as pointed) when dealing with multiple
> > columns.
> > > > > > You'll have to use a barrier of some kind, e.g. a per-batch global
> > > > > > atomic (if append-only), or dedicated column(s) à-la MVCC. But
> > then,
> > > > > > the reader needs to be aware of this and compute a mask each time
> > it
> > > > > > needs to query the partial batch.
> > > > > >
> > > > > > This is a common columnar database problem, see [1] for a recent
> > paper
> > > > > > on the subject. The usual technique is to store the recent data
> > > > > > row-wise, and transform it in column-wise when a threshold is met
> > akin
> > > > > > to a compaction phase. There was a somewhat related thread [2]
> > lately
> > > > > > about streaming vs batching. In the end, I think your solution
> > will be
> > > > > > very application specific.
> > > > > >
> > > > > > François
> > > > > >
> > > > > > [1] https://db.in.tum.de/downloads/publications/datablocks.pdf
> > > > > > [2]
> > > > > >
> > > >
> > https://lists.apache.org/thread.html/27945533db782361143586fd77ca08e15e96e2f2a5250ff084b462d6@%3Cdev.arrow.apache.org%3E
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Mon, May 6, 2019 at 10:39 AM John Muehlhausen <j...@jgm.org>
> > wrote:
> > > > > > >
> > > > > > > Wes,
> > > > > > >
> > > > > > > I’m not afraid of writing my own C++ code to deal with all of
> > this
> > > > on the
> > > > > > > writer side.  I just need a way to “append” (incrementally
> > populate)
> > > > e.g.
> > > > > > > feather files so that a person using e.g. pyarrow doesn’t suffer
> > some
> > > > > > > catastrophic failure... and “on the side” I tell them which rows
> > are
> > > > junk
> > > > > > > and deal with any concurrency issues that can’t be solved in the
> > > > arena of
> > > > > > > atomicity and ordering of ops.  For now I care about basic types
> > but
> > > > > > > including variable-width strings.
> > > > > > >
> > > > > > > For event-processing, I think Arrow has to have the concept of a
> > > > > > partially
> > > > > > > full record set.  Some alternatives are:
> > > > > > > - have a batch size of one, thus littering the landscape with
> > > > trivially
> > > > > > > small Arrow buffers
> > > > > > > - artificially increase latency with a batch size larger than
> > one,
> > > > but
> > > > > > not
> > > > > > > processing any data until a batch is complete
> > > > > > > - continuously re-write the (entire!) “main” buffer as batches of
> > > > length
> > > > > > 1
> > > > > > > roll in
> > > > > > > - instead of one main buffer, several, and at some threshold
> > combine
> > > > the
> > > > > > > last N length-1 batches into a length N buffer ... still an
> > > > inefficiency
> > > > > > >
> > > > > > > Consider the case of QAbstractTableModel as the underlying data
> > for a
> > > > > > table
> > > > > > > or a chart.  This visualization shows all of the data for the
> > recent
> > > > past
> > > > > > > as well as events rolling in.  If this model interface is
> > > > implemented as
> > > > > > a
> > > > > > > view onto “many thousands” of individual event buffers then we
> > gain
> > > > > > nothing
> > > > > > > from columnar layout.  (Suppose there are tons of columns and
> > most of
> > > > > > them
> > > > > > > are scrolled out of the view.). Likewise we cannot re-write the
> > > > entire
> > > > > > > model on each event... time complexity blows up.  What we want
> > is to
> > > > > > have a
> > > > > > > large pre-allocated chunk and just change rowCount() as data is
> > > > > > “appended.”
> > > > > > >  Sure, we may run out of space and have another and another
> > chunk for
> > > > > > > future row ranges, but a handful of chunks chained together is
> > better
> > > > > > than
> > > > > > > as many chunks as there were events!
> > > > > > >
> > > > > > > And again, having a batch size >1 and delaying the data until a
> > > > batch is
> > > > > > > full is a non-starter.
> > > > > > >
> > > > > > > I am really hoping to see partially-filled buffers as something
> > we
> > > > keep
> > > > > > our
> > > > > > > finger on moving forward!  Or else, what am I missing?
> > > > > > >
> > > > > > > -John
> > > > > > >
> > > > > > > On Mon, May 6, 2019 at 8:24 AM Wes McKinney <wesmck...@gmail.com
> > >
> > > > wrote:
> > > > > > >
> > > > > > > > hi John,
> > > > > > > >
> > > > > > > > In C++ the builder classes don't yet support writing into
> > > > preallocated
> > > > > > > > memory. It would be tricky for applications to determine a
> > priori
> > > > > > > > which segments of memory to pass to the builder. It seems only
> > > > > > > > feasible for primitive / fixed-size types so my guess would be
> > > > that a
> > > > > > > > separate set of interfaces would need to be developed for this
> > > > task.
> > > > > > > >
> > > > > > > > - Wes
> > > > > > > >
> > > > > > > > On Mon, May 6, 2019 at 8:18 AM Jacques Nadeau <
> > jacq...@apache.org>
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > This is more of a question of implementation versus
> > > > specification. An
> > > > > > > > arrow
> > > > > > > > > buffer is generally built and then sealed. In different
> > > > languages,
> > > > > > this
> > > > > > > > > building process works differently (a concern of the language
> > > > rather
> > > > > > than
> > > > > > > > > the memory specification). We don't currently allow a half
> > built
> > > > > > vector
> > > > > > > > to
> > > > > > > > > be moved to another language and then be further built. So
> > the
> > > > > > question
> > > > > > > > is
> > > > > > > > > really more concrete: what language are you looking at and
> > what
> > > > is
> > > > > > the
> > > > > > > > > specific pattern you're trying to undertake for building.
> > > > > > > > >
> > > > > > > > > If you're trying to go across independent processes (whether
> > the
> > > > same
> > > > > > > > > process restarted or two separate processes active
> > > > simultaneously)
> > > > > > you'll
> > > > > > > > > need to build up your own data structures to help with this.
> > > > > > > > >
> > > > > > > > > On Mon, May 6, 2019 at 6:28 PM John Muehlhausen <j...@jgm.org
> > >
> > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hello,
> > > > > > > > > >
> > > > > > > > > > Glad to learn of this project— good work!
> > > > > > > > > >
> > > > > > > > > > If I allocate a single chunk of memory and start building
> > Arrow
> > > > > > format
> > > > > > > > > > within it, does this chunk save any state regarding my
> > > > progress?
> > > > > > > > > >
> > > > > > > > > > For example, suppose I allocate a column for floating point
> > > > (fixed
> > > > > > > > width)
> > > > > > > > > > and a column for string (variable width).  Suppose I start
> > > > > > building the
> > > > > > > > > > floating point column at offset X into my single buffer,
> > and
> > > > the
> > > > > > string
> > > > > > > > > > “pointer” column at offset Y into the same single buffer,
> > and
> > > > the
> > > > > > > > string
> > > > > > > > > > data elements at offset Z.
> > > > > > > > > >
> > > > > > > > > > I write one floating point number and one string, then go
> > away.
> > > > > > When I
> > > > > > > > > > come back to this buffer to append another value, does the
> > > > buffer
> > > > > > > > itself
> > > > > > > > > > know where I would begin?  I.e. is there a differentiation
> > in
> > > > the
> > > > > > > > column
> > > > > > > > > > (or blob) data itself between the available space and the
> > used
> > > > > > space?
> > > > > > > > > >
> > > > > > > > > > Suppose I write a lot of large variable width strings and
> > “run
> > > > > > out” of
> > > > > > > > > > space for them before running out of space for floating
> > point
> > > > > > numbers
> > > > > > > > or
> > > > > > > > > > string pointers.  (I guessed badly when doing the original
> > > > > > > > allocation.). I
> > > > > > > > > > consider this to be Ok since I can always “copy” the data
> > to
> > > > > > “compress
> > > > > > > > out”
> > > > > > > > > > the unused fp/pointer buckets... the choice is up to me.
> > > > > > > > > >
> > > > > > > > > > The above applied to a (feather?) file is how I anticipate
> > > > > > appending
> > > > > > > > data
> > > > > > > > > > to disk... pre-allocate a mem-mapped file and gradually
> > fill
> > > > it up.
> > > > > > > > The
> > > > > > > > > > efficiency of file utilization will depend on my
> > projections
> > > > > > regarding
> > > > > > > > > > variable-width data types, but as I said above, I can
> > always
> > > > > > re-write
> > > > > > > > the
> > > > > > > > > > file if/when this bothers me.
> > > > > > > > > >
> > > > > > > > > > Is this the recommended and supported approach for
> > incremental
> > > > > > appends?
> > > > > > > > > > I’m really hoping to use Arrow instead of rolling my own,
> > but
> > > > > > > > functionality
> > > > > > > > > > like this is absolutely key!  Hoping not to use a side-car
> > > > file (or
> > > > > > > > memory
> > > > > > > > > > chunk) to store “append progress” information.
> > > > > > > > > >
> > > > > > > > > > I am brand new to this project so please forgive me if I
> > have
> > > > > > > > overlooked
> > > > > > > > > > something obvious.  And again, looks like great work so
> > far!
> > > > > > > > > >
> > > > > > > > > > Thanks!
> > > > > > > > > > -John
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> >

Re: Stored state of incremental writes to fixed size Arrow buffer?

Reply via email to