hi John -- again, I would caution you against using Feather files for issues of longevity -- the internal memory layout of those files is a "dead man walking" so to speak.
I would advise against forking the project, IMHO that is a dark path that leads nowhere good. We have a large community here and we accept pull requests -- I think the challenge is going to be defining the use case to suitable clarity that a general purpose solution can be developed. - Wes On Mon, May 6, 2019 at 11:16 AM John Muehlhausen <j...@jgm.org> wrote: > > François, Wes, > > Thanks for the feedback. I think the most practical thing for me to do is > 1- write a Feather file that is structured to pre-allocate the space I need > (e.g. initial variable-length strings are of average size) > 2- come up with code to monkey around with the values contained in the > vectors so that before and after each manipulation the file is valid as I > walk the rows ... this is a writer that uses memory mapping > 3- check back in here once that works, assuming that it does, to see if we > can bless certain mutation paths > 4- if we can't bless certain mutation paths, fork the project for those who > care more about stream processing? That would not seem to be ideal as I > think mutation in row-order across the data set is relatively low impact on > the overall design? > > Thanks again for engaging the topic! > -John > > On Mon, May 6, 2019 at 10:26 AM Francois Saint-Jacques < > fsaintjacq...@gmail.com> wrote: > > > Hello John, > > > > Arrow is not yet suited for partial writes. The specification only > > talks about fully frozen/immutable objects, you're in implementation > > defined territory here. For example, the C++ library assumes the Array > > object is immutable; it memoize the null count, and likely more > > statistics in the future. > > > > If you want to use pre-allocated buffers and array, you can use the > > column validity bitmap for this purpose, e.g. set all null by default > > and flip once the row is written. It suffers from concurrency issues > > (+ invalidation issues as pointed) when dealing with multiple columns. > > You'll have to use a barrier of some kind, e.g. a per-batch global > > atomic (if append-only), or dedicated column(s) à-la MVCC. But then, > > the reader needs to be aware of this and compute a mask each time it > > needs to query the partial batch. > > > > This is a common columnar database problem, see [1] for a recent paper > > on the subject. The usual technique is to store the recent data > > row-wise, and transform it in column-wise when a threshold is met akin > > to a compaction phase. There was a somewhat related thread [2] lately > > about streaming vs batching. In the end, I think your solution will be > > very application specific. > > > > François > > > > [1] https://db.in.tum.de/downloads/publications/datablocks.pdf > > [2] > > https://lists.apache.org/thread.html/27945533db782361143586fd77ca08e15e96e2f2a5250ff084b462d6@%3Cdev.arrow.apache.org%3E > > > > > > > > > > > > > > > > On Mon, May 6, 2019 at 10:39 AM John Muehlhausen <j...@jgm.org> wrote: > > > > > > Wes, > > > > > > I’m not afraid of writing my own C++ code to deal with all of this on the > > > writer side. I just need a way to “append” (incrementally populate) e.g. > > > feather files so that a person using e.g. pyarrow doesn’t suffer some > > > catastrophic failure... and “on the side” I tell them which rows are junk > > > and deal with any concurrency issues that can’t be solved in the arena of > > > atomicity and ordering of ops. For now I care about basic types but > > > including variable-width strings. > > > > > > For event-processing, I think Arrow has to have the concept of a > > partially > > > full record set. Some alternatives are: > > > - have a batch size of one, thus littering the landscape with trivially > > > small Arrow buffers > > > - artificially increase latency with a batch size larger than one, but > > not > > > processing any data until a batch is complete > > > - continuously re-write the (entire!) “main” buffer as batches of length > > 1 > > > roll in > > > - instead of one main buffer, several, and at some threshold combine the > > > last N length-1 batches into a length N buffer ... still an inefficiency > > > > > > Consider the case of QAbstractTableModel as the underlying data for a > > table > > > or a chart. This visualization shows all of the data for the recent past > > > as well as events rolling in. If this model interface is implemented as > > a > > > view onto “many thousands” of individual event buffers then we gain > > nothing > > > from columnar layout. (Suppose there are tons of columns and most of > > them > > > are scrolled out of the view.). Likewise we cannot re-write the entire > > > model on each event... time complexity blows up. What we want is to > > have a > > > large pre-allocated chunk and just change rowCount() as data is > > “appended.” > > > Sure, we may run out of space and have another and another chunk for > > > future row ranges, but a handful of chunks chained together is better > > than > > > as many chunks as there were events! > > > > > > And again, having a batch size >1 and delaying the data until a batch is > > > full is a non-starter. > > > > > > I am really hoping to see partially-filled buffers as something we keep > > our > > > finger on moving forward! Or else, what am I missing? > > > > > > -John > > > > > > On Mon, May 6, 2019 at 8:24 AM Wes McKinney <wesmck...@gmail.com> wrote: > > > > > > > hi John, > > > > > > > > In C++ the builder classes don't yet support writing into preallocated > > > > memory. It would be tricky for applications to determine a priori > > > > which segments of memory to pass to the builder. It seems only > > > > feasible for primitive / fixed-size types so my guess would be that a > > > > separate set of interfaces would need to be developed for this task. > > > > > > > > - Wes > > > > > > > > On Mon, May 6, 2019 at 8:18 AM Jacques Nadeau <jacq...@apache.org> > > wrote: > > > > > > > > > > This is more of a question of implementation versus specification. An > > > > arrow > > > > > buffer is generally built and then sealed. In different languages, > > this > > > > > building process works differently (a concern of the language rather > > than > > > > > the memory specification). We don't currently allow a half built > > vector > > > > to > > > > > be moved to another language and then be further built. So the > > question > > > > is > > > > > really more concrete: what language are you looking at and what is > > the > > > > > specific pattern you're trying to undertake for building. > > > > > > > > > > If you're trying to go across independent processes (whether the same > > > > > process restarted or two separate processes active simultaneously) > > you'll > > > > > need to build up your own data structures to help with this. > > > > > > > > > > On Mon, May 6, 2019 at 6:28 PM John Muehlhausen <j...@jgm.org> wrote: > > > > > > > > > > > Hello, > > > > > > > > > > > > Glad to learn of this project— good work! > > > > > > > > > > > > If I allocate a single chunk of memory and start building Arrow > > format > > > > > > within it, does this chunk save any state regarding my progress? > > > > > > > > > > > > For example, suppose I allocate a column for floating point (fixed > > > > width) > > > > > > and a column for string (variable width). Suppose I start > > building the > > > > > > floating point column at offset X into my single buffer, and the > > string > > > > > > “pointer” column at offset Y into the same single buffer, and the > > > > string > > > > > > data elements at offset Z. > > > > > > > > > > > > I write one floating point number and one string, then go away. > > When I > > > > > > come back to this buffer to append another value, does the buffer > > > > itself > > > > > > know where I would begin? I.e. is there a differentiation in the > > > > column > > > > > > (or blob) data itself between the available space and the used > > space? > > > > > > > > > > > > Suppose I write a lot of large variable width strings and “run > > out” of > > > > > > space for them before running out of space for floating point > > numbers > > > > or > > > > > > string pointers. (I guessed badly when doing the original > > > > allocation.). I > > > > > > consider this to be Ok since I can always “copy” the data to > > “compress > > > > out” > > > > > > the unused fp/pointer buckets... the choice is up to me. > > > > > > > > > > > > The above applied to a (feather?) file is how I anticipate > > appending > > > > data > > > > > > to disk... pre-allocate a mem-mapped file and gradually fill it up. > > > > The > > > > > > efficiency of file utilization will depend on my projections > > regarding > > > > > > variable-width data types, but as I said above, I can always > > re-write > > > > the > > > > > > file if/when this bothers me. > > > > > > > > > > > > Is this the recommended and supported approach for incremental > > appends? > > > > > > I’m really hoping to use Arrow instead of rolling my own, but > > > > functionality > > > > > > like this is absolutely key! Hoping not to use a side-car file (or > > > > memory > > > > > > chunk) to store “append progress” information. > > > > > > > > > > > > I am brand new to this project so please forgive me if I have > > > > overlooked > > > > > > something obvious. And again, looks like great work so far! > > > > > > > > > > > > Thanks! > > > > > > -John > > > > > > > > > > > >