note that that would be an upper bound because buffers can be shared between arrays.
On Wed, Sep 1, 2021 at 2:15 PM Antoine Pitrou <anto...@python.org> wrote: > On Tue, 31 Aug 2021 21:46:23 -0700 > Rares Vernica <rvern...@gmail.com> wrote: > > > > I'm storing RecordBatch objects in a local cache to improve performance. > I > > want to keep track of the memory usage to stay within bounds. The arrays > > stored in the batch are not nested. > > > > The best way I came up to compute the size of a RecordBatch is: > > > > size_t arrowSize = 0; > > for (auto i = 0; i < arrowBatch->num_columns(); ++i) { > > auto column = arrowBatch->column_data(i); > > if (column->buffers[0]) > > arrowSize += column->buffers[0]->size(); > > if (column->buffers[1]) > > arrowSize += column->buffers[1]->size(); > > } > > > > Does this look reasonable? I guess we are over estimating a bit due to > the > > buffer alignment but that should be fine. > > Probably, but you should iterate over all buffers instead of > selecting just buffers 0 and 1 (what if you have a string column?). > > So basically: > > ``` > size_t arrowSize = 0; > for (const auto& column : batch->columns()) { > for (const auto& buffer : column->data()->buffers) { > if (buffer) > arrowSize += buffer->size(); > } > } > ``` > > Regards > > Antoine. > > >