Re: [C++] Building a ChunkedArray with allocation size control

Eric Jacobs Fri, 05 Jul 2024 15:20:17 -0700

Yup, I'm with you. In the code snippet I shared, you'll notice I had a`len` argument passed in which is counted against the available spaceand which functions as a conservative estimate of how much buffer spacethis element could take up.

In other words, it's saying that no BufferBuilder would receive morethan `len` bytes as a result of the forthcoming operation. It seemspossible to determine such an upper limit - just getting at theBufferBuilders is what I don't have.

StringViewArray (a recent addition [1]) allows a more flexiblechunking of the data buffers [2].


Thanks! I'll check it out.

-Eric


Felipe Oliveira Carvalho wrote:

> However, I'm not seeing how it would be necessary on every appendsince the topology wouldn't be changing during the build of a singlechunk (correct me if I'm wrong.)

A StringArray, for example, stores all the strings in a single buffer.One after the other. So after every append, the size of the databuffer can go anywhere.

If you say you're going to append `len` strings, they could all beempty (buffer grows by 0 bytes) or something like 1 Mb each (buffergrows by len * 1Mb). Similar problems with ListArray which store allthe elements from the lists on the same child array. If that childarray is a string array, you're now 2 orders of uncertainty furtherfrom size estimation.

StringViewArray (a recent addition [1]) allows a more flexiblechunking of the data buffers [2].


--
Felipe

[1]https://engineering.fb.com/2024/02/20/developer-tools/velox-apache-arrow-15-composable-data-management/[2]https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-view-layout

On Fri, Jul 5, 2024 at 1:35 PM Eric Jacobs <[email protected]<mailto:[email protected]>> wrote:


    Felipe Oliveira Carvalho wrote:
    > Hi,
    > The builders can't really know the size of the buffers when nested
    > types are involved. The general solution would be an expensive
    > traversal of the entire tree of builders (e.g. struct builder of
    > nested column types like strings) on every append.

    I understand that the number and structure of the buffers used
    will be
    different depending on the datatype of the arrays, and I'm okay with
    doing a traversal of the builder tree to identify all of the
    buffers in
    use. However, I'm not seeing how it would be necessary on every
    append
    since the topology wouldn't be changing during the build of a single
    chunk (correct me if I'm wrong.) A re-traversal of the builder
    tree on a
    wider granularity basis (e.g. in between chunks) would be acceptable.

    > :
    > Also make sure you allow length to be > 0 because if a single
    string
    > is bigger than X MB, you will *have to* violate this max buffer
    > constraint. It can only be a soft constraint in a robust solution.
    >

    If there's no way that the constraint can be maintained as per the
    Arrow
    in-memory format, it will throw an error out from my MemoryPool,
    and in
    that case it just won't be supported here.

    Thanks,
    -Eric

    > __
    > Felipe
    >
    > On Thu, Jul 4, 2024 at 3:12 PM Eric Jacobs
    <[email protected] <mailto:[email protected]>
    > <mailto:[email protected]
    <mailto:[email protected]>>> wrote:
    >
    >     Hi,
    >     I would like to build a ChunkedArray but I need to limit the
    maximum
    >     size of each buffer (somewhere in the low MB's). Ending the
    current
    >     chunk and starting a new one is straightforward, but I'm
    having some
    >     difficulty detecting when the current buffer(s) are close to
    getting
    >     full. If I had the Builders I could check the length() as
    they are
    >     going
    >     along, but I'm not sure how I can get access to those as
    >     ChunkedArray is
    >     being built via the API.
    >
    >     The size control doesn't have to be precise in my case; it just
    >     needs to
    >     be conservative as a limit (i.e. the builder cannot go over
    X MB)
    >
    >       Any advice would be appreciated.
    >     Thanks,
    >     -Eric
    >
    >

Re: [C++] Building a ChunkedArray with allocation size control

Reply via email to