Passing len to refer to bytes goes against convention, so you confused me
there, but I understand what you mean better now.

If your goal is (1) bounded memory allocation and (2) good utilization of
the allocated memory, this API from the builders wouldn't help you.

There is a case where you get enough space on the maximum buffer of an
array builder, but appending a new item doubles the size of another buffer
when all you needed were 4 extra bytes.

Example: enough space on the data buffer of a StringBuilder, but the
offsets buffer is full. Append would trigger the geometric growth of
offsets array.

The builders are intended for use when you want them to grow to
accommodate the appends. The best way to *invert* that is by calling
`Reserve` upfront and making sure you don't append more than what you
reserved.

--
Felipe

On Fri, Jul 5, 2024 at 7:20 PM Eric Jacobs <[email protected]> wrote:

> Yup, I'm with you. In the code snippet I shared, you'll notice I had a
> `len` argument passed in which is counted against the available space
> and which functions as a conservative estimate of how much buffer space
> this element could take up.
>
> In other words, it's saying that no BufferBuilder would receive more
> than `len` bytes as a result of the forthcoming operation. It seems
> possible to determine such an upper limit - just getting at the
> BufferBuilders is what I don't have.
>
> >
> > StringViewArray (a recent addition [1]) allows a more flexible
> > chunking of the data buffers [2].
>
> Thanks! I'll check it out.
>
> -Eric
>
>
> Felipe Oliveira Carvalho wrote:
> > > However, I'm not seeing how it would be necessary on every append
> > since the topology wouldn't be changing during the build of a single
> > chunk (correct me if I'm wrong.)
> >
> > A StringArray, for example, stores all the strings in a single buffer.
> > One after the other. So after every append, the size of the data
> > buffer can go anywhere.
> >
> > If you say you're going to append `len` strings, they could all be
> > empty (buffer grows by 0 bytes) or something like 1 Mb each (buffer
> > grows by len * 1Mb). Similar problems with ListArray which store all
> > the elements from the lists on the same child array. If that child
> > array is a string array, you're now 2 orders of uncertainty further
> > from size estimation.
> >
> > StringViewArray (a recent addition [1]) allows a more flexible
> > chunking of the data buffers [2].
> >
> > --
> > Felipe
> >
> > [1]
> >
> https://engineering.fb.com/2024/02/20/developer-tools/velox-apache-arrow-15-composable-data-management/
> > [2]
> >
> https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-view-layout
> >
> >
> >
> > On Fri, Jul 5, 2024 at 1:35 PM Eric Jacobs <[email protected]
> > <mailto:[email protected]>> wrote:
> >
> >     Felipe Oliveira Carvalho wrote:
> >     > Hi,
> >     > The builders can't really know the size of the buffers when nested
> >     > types are involved. The general solution would be an expensive
> >     > traversal of the entire tree of builders (e.g. struct builder of
> >     > nested column types like strings) on every append.
> >
> >     I understand that the number and structure of the buffers used
> >     will be
> >     different depending on the datatype of the arrays, and I'm okay with
> >     doing a traversal of the builder tree to identify all of the
> >     buffers in
> >     use. However, I'm not seeing how it would be necessary on every
> >     append
> >     since the topology wouldn't be changing during the build of a single
> >     chunk (correct me if I'm wrong.) A re-traversal of the builder
> >     tree on a
> >     wider granularity basis (e.g. in between chunks) would be acceptable.
> >
> >     > :
> >     > Also make sure you allow length to be > 0 because if a single
> >     string
> >     > is bigger than X MB, you will *have to* violate this max buffer
> >     > constraint. It can only be a soft constraint in a robust solution.
> >     >
> >
> >     If there's no way that the constraint can be maintained as per the
> >     Arrow
> >     in-memory format, it will throw an error out from my MemoryPool,
> >     and in
> >     that case it just won't be supported here.
> >
> >     Thanks,
> >     -Eric
> >
> >     > __
> >     > Felipe
> >     >
> >     > On Thu, Jul 4, 2024 at 3:12 PM Eric Jacobs
> >     <[email protected] <mailto:[email protected]>
> >     > <mailto:[email protected]
> >     <mailto:[email protected]>>> wrote:
> >     >
> >     >     Hi,
> >     >     I would like to build a ChunkedArray but I need to limit the
> >     maximum
> >     >     size of each buffer (somewhere in the low MB's). Ending the
> >     current
> >     >     chunk and starting a new one is straightforward, but I'm
> >     having some
> >     >     difficulty detecting when the current buffer(s) are close to
> >     getting
> >     >     full. If I had the Builders I could check the length() as
> >     they are
> >     >     going
> >     >     along, but I'm not sure how I can get access to those as
> >     >     ChunkedArray is
> >     >     being built via the API.
> >     >
> >     >     The size control doesn't have to be precise in my case; it just
> >     >     needs to
> >     >     be conservative as a limit (i.e. the builder cannot go over
> >     X MB)
> >     >
> >     >       Any advice would be appreciated.
> >     >     Thanks,
> >     >     -Eric
> >     >
> >     >
> >
>
>

Reply via email to