Felipe Oliveira Carvalho wrote:
Hi,
The builders can't really know the size of the buffers when nested
types are involved. The general solution would be an expensive
traversal of the entire tree of builders (e.g. struct builder of
nested column types like strings) on every append.
I understand that the number and structure of the buffers used will be
different depending on the datatype of the arrays, and I'm okay with
doing a traversal of the builder tree to identify all of the buffers in
use. However, I'm not seeing how it would be necessary on every append
since the topology wouldn't be changing during the build of a single
chunk (correct me if I'm wrong.) A re-traversal of the builder tree on a
wider granularity basis (e.g. in between chunks) would be acceptable.
:
Also make sure you allow length to be > 0 because if a single string
is bigger than X MB, you will *have to* violate this max buffer
constraint. It can only be a soft constraint in a robust solution.
If there's no way that the constraint can be maintained as per the Arrow
in-memory format, it will throw an error out from my MemoryPool, and in
that case it just won't be supported here.
Thanks,
-Eric
__
Felipe
On Thu, Jul 4, 2024 at 3:12 PM Eric Jacobs <[email protected]
<mailto:[email protected]>> wrote:
Hi,
I would like to build a ChunkedArray but I need to limit the maximum
size of each buffer (somewhere in the low MB's). Ending the current
chunk and starting a new one is straightforward, but I'm having some
difficulty detecting when the current buffer(s) are close to getting
full. If I had the Builders I could check the length() as they are
going
along, but I'm not sure how I can get access to those as
ChunkedArray is
being built via the API.
The size control doesn't have to be precise in my case; it just
needs to
be conservative as a limit (i.e. the builder cannot go over X MB)
Any advice would be appreciated.
Thanks,
-Eric