note that that would be an upper bound because buffers can be shared
between arrays.

On Wed, Sep 1, 2021 at 2:15 PM Antoine Pitrou <anto...@python.org> wrote:

> On Tue, 31 Aug 2021 21:46:23 -0700
> Rares Vernica <rvern...@gmail.com> wrote:
> >
> > I'm storing RecordBatch objects in a local cache to improve performance.
> I
> > want to keep track of the memory usage to stay within bounds. The arrays
> > stored in the batch are not nested.
> >
> > The best way I came up to compute the size of a RecordBatch is:
> >
> >             size_t arrowSize = 0;
> >             for (auto i = 0; i < arrowBatch->num_columns(); ++i) {
> >                 auto column = arrowBatch->column_data(i);
> >                 if (column->buffers[0])
> >                     arrowSize += column->buffers[0]->size();
> >                 if (column->buffers[1])
> >                     arrowSize += column->buffers[1]->size();
> >             }
> >
> > Does this look reasonable? I guess we are over estimating a bit due to
> the
> > buffer alignment but that should be fine.
>
> Probably, but you should iterate over all buffers instead of
> selecting just buffers 0 and 1 (what if you have a string column?).
>
> So basically:
>
> ```
> size_t arrowSize = 0;
> for (const auto& column : batch->columns()) {
>   for (const auto& buffer : column->data()->buffers) {
>     if (buffer)
>       arrowSize += buffer->size();
>   }
> }
> ```
>
> Regards
>
> Antoine.
>
>
>

Reply via email to