Re: Chunk Table into RecordBatches of at most 512MB each

Greg Lowe Thu, 14 Mar 2024 17:15:38 -0700

Note, I'm mostly concerned about constraining the memory use when reading
record batches from the IPC format. I'm not so concerned about memory use
by the builders while writing them.


Is the power of two allocation also used when reading a record batch from
an IPC record? I would have assumed that wouldn't be necessary since the
required sizes would be known up front and be encoded in the IPC format.

On Fri, 15 Mar 2024 at 11:33, Jacques Nadeau <jacq...@apache.org> wrote:

> It depends on the implementation but some implementations use power if two
> allocations or similar (not sure in golang front). So one might start with
> space for 80 integers and then once you get to 81, allocation doubles to
> 160 integers. I know the Java library historically operated this way
> (albeit not exactly a power of two because of space related to colocated
> allocations for nullability). So trying to constrain memory with record at
> a time writing/reallocation will likely turn out pretty poorly. I recommend
> you preallocate your batch size based on estimates initially to max memory
> and then fill things in and then adjust your estimation algorithm over
> time.
>
> On Thu, Mar 14, 2024, 12:25 PM Greg Lowe <greg.l...@gmail.com> wrote:
>
>> Hi all,
>>
>> I'm aiming to reply to the following thread. Not sure if this message
>> will appear in the right place.
>> https://lists.apache.org/thread/93kg641xk52lm5m11vwodbyc1hzvbnf3
>>
>> I've implemented a workaround for a similar use case. I thought
>> I'd share, as either someone could recommend a better solution using the
>> existing API. Or perhaps to discuss additions to the API which could make
>> this easier.
>>
>> In my use case the limitation is the memory available when reading a
>> record batch. I'd like to keep the in-memory size of each record batch
>> within a maximum number of bytes. Note, I'm not concerned about the disk
>> size (which will be smaller due to LZ4 compression).
>>
>> So when appending values, I'd like to be able to specify a maximum value,
>> say 500MB, and then once that's exceeded write the record batch to disk.
>>
>> The data types I need to support are float64, int64,
>> bool, listof(float64), listof(int64), listof(bool), and strings.
>>
>> In my use case, I'm writing to a builder in a row-wise fashion. My
>> current approach is, when I write each cell I increment a variable which
>> keeps track of the approximate used memory size in bytes. Luckily, for the
>> types I need to support, this is fairly simple to track approximately.
>>
>> i.e. a float64 is "+8", list-of float64 is "len(floats)*8+8".
>>
>> Is there a better way to do this using the existing API?
>>
>> Would it make sense for this to be supported natively by the API?
>>
>> I'm using the Go implementation. But I guess this applies equally to the
>> C++, and maybe other implementations too.
>>
>> Thanks for taking the time to read this.
>>
>> Cheers,
>> Greg
>>
>

Re: Chunk Table into RecordBatches of at most 512MB each

Reply via email to