hi John, Feel free to open some JIRA issues to make a specific proposal about what you want to see in the libraries
I would recommend not coupling yourself to the Feather format as it stands now, as I would like to change it as soon as > 90% of R users can successfully install the Arrow libraries (they cannot at present, so I've been holding off on doing more there) - Wes On Mon, May 6, 2019 at 9:39 AM John Muehlhausen <j...@jgm.org> wrote: > > Wes, > > I’m not afraid of writing my own C++ code to deal with all of this on the > writer side. I just need a way to “append” (incrementally populate) e.g. > feather files so that a person using e.g. pyarrow doesn’t suffer some > catastrophic failure... and “on the side” I tell them which rows are junk > and deal with any concurrency issues that can’t be solved in the arena of > atomicity and ordering of ops. For now I care about basic types but > including variable-width strings. > > For event-processing, I think Arrow has to have the concept of a partially > full record set. Some alternatives are: > - have a batch size of one, thus littering the landscape with trivially > small Arrow buffers > - artificially increase latency with a batch size larger than one, but not > processing any data until a batch is complete > - continuously re-write the (entire!) “main” buffer as batches of length 1 > roll in > - instead of one main buffer, several, and at some threshold combine the > last N length-1 batches into a length N buffer ... still an inefficiency > > Consider the case of QAbstractTableModel as the underlying data for a table > or a chart. This visualization shows all of the data for the recent past > as well as events rolling in. If this model interface is implemented as a > view onto “many thousands” of individual event buffers then we gain nothing > from columnar layout. (Suppose there are tons of columns and most of them > are scrolled out of the view.). Likewise we cannot re-write the entire > model on each event... time complexity blows up. What we want is to have a > large pre-allocated chunk and just change rowCount() as data is “appended.” > Sure, we may run out of space and have another and another chunk for > future row ranges, but a handful of chunks chained together is better than > as many chunks as there were events! > > And again, having a batch size >1 and delaying the data until a batch is > full is a non-starter. > > I am really hoping to see partially-filled buffers as something we keep our > finger on moving forward! Or else, what am I missing? > > -John > > On Mon, May 6, 2019 at 8:24 AM Wes McKinney <wesmck...@gmail.com> wrote: > > > hi John, > > > > In C++ the builder classes don't yet support writing into preallocated > > memory. It would be tricky for applications to determine a priori > > which segments of memory to pass to the builder. It seems only > > feasible for primitive / fixed-size types so my guess would be that a > > separate set of interfaces would need to be developed for this task. > > > > - Wes > > > > On Mon, May 6, 2019 at 8:18 AM Jacques Nadeau <jacq...@apache.org> wrote: > > > > > > This is more of a question of implementation versus specification. An > > arrow > > > buffer is generally built and then sealed. In different languages, this > > > building process works differently (a concern of the language rather than > > > the memory specification). We don't currently allow a half built vector > > to > > > be moved to another language and then be further built. So the question > > is > > > really more concrete: what language are you looking at and what is the > > > specific pattern you're trying to undertake for building. > > > > > > If you're trying to go across independent processes (whether the same > > > process restarted or two separate processes active simultaneously) you'll > > > need to build up your own data structures to help with this. > > > > > > On Mon, May 6, 2019 at 6:28 PM John Muehlhausen <j...@jgm.org> wrote: > > > > > > > Hello, > > > > > > > > Glad to learn of this project— good work! > > > > > > > > If I allocate a single chunk of memory and start building Arrow format > > > > within it, does this chunk save any state regarding my progress? > > > > > > > > For example, suppose I allocate a column for floating point (fixed > > width) > > > > and a column for string (variable width). Suppose I start building the > > > > floating point column at offset X into my single buffer, and the string > > > > “pointer” column at offset Y into the same single buffer, and the > > string > > > > data elements at offset Z. > > > > > > > > I write one floating point number and one string, then go away. When I > > > > come back to this buffer to append another value, does the buffer > > itself > > > > know where I would begin? I.e. is there a differentiation in the > > column > > > > (or blob) data itself between the available space and the used space? > > > > > > > > Suppose I write a lot of large variable width strings and “run out” of > > > > space for them before running out of space for floating point numbers > > or > > > > string pointers. (I guessed badly when doing the original > > allocation.). I > > > > consider this to be Ok since I can always “copy” the data to “compress > > out” > > > > the unused fp/pointer buckets... the choice is up to me. > > > > > > > > The above applied to a (feather?) file is how I anticipate appending > > data > > > > to disk... pre-allocate a mem-mapped file and gradually fill it up. > > The > > > > efficiency of file utilization will depend on my projections regarding > > > > variable-width data types, but as I said above, I can always re-write > > the > > > > file if/when this bothers me. > > > > > > > > Is this the recommended and supported approach for incremental appends? > > > > I’m really hoping to use Arrow instead of rolling my own, but > > functionality > > > > like this is absolutely key! Hoping not to use a side-car file (or > > memory > > > > chunk) to store “append progress” information. > > > > > > > > I am brand new to this project so please forgive me if I have > > overlooked > > > > something obvious. And again, looks like great work so far! > > > > > > > > Thanks! > > > > -John > > > > > >