Good to know. IPC in general should be better. The worse case scenario I've seen is the rowwise population situation.
On Thu, Mar 14, 2024, 2:52 PM Greg Lowe <greg.l...@gmail.com> wrote: > I had a quick look at the Arrow Go source code. In the IPC case, when > using the Go allocator, it looks like it allocates to the nearest multiple > of 64 bytes. I'm not very familiar with the details of how the Go runtime > handles large byte array allocations. But with a quick scan of the docs, I > believe these get rounded to the nearest page size of 4K. So I don't think > there's a power of two issue when reading record batches via IPC. > > > On Fri, 15 Mar 2024 at 13:34, Jacques Nadeau <jacq...@apache.org> wrote: > >> I would expect go to allocate to IPC size but the underlying allocator >> behavior will still be present. It seems like golang uses tcmalloc so it >> would probably round up to the next tcmalloc size. I'd assume waste >> increases at larger allocation sizes but you'd have to review the detail to >> better understand. >> >> On Thu, Mar 14, 2024, 2:15 PM Greg Lowe <greg.l...@gmail.com> wrote: >> >>> Note, I'm mostly concerned about constraining the memory use when >>> reading record batches from the IPC format. I'm not so concerned about >>> memory use by the builders while writing them. >>> >>> Is the power of two allocation also used when reading a record batch >>> from an IPC record? I would have assumed that wouldn't be necessary since >>> the required sizes would be known up front and be encoded in the IPC format. >>> >>> On Fri, 15 Mar 2024 at 11:33, Jacques Nadeau <jacq...@apache.org> wrote: >>> >>>> It depends on the implementation but some implementations use power if >>>> two allocations or similar (not sure in golang front). So one might start >>>> with space for 80 integers and then once you get to 81, allocation doubles >>>> to 160 integers. I know the Java library historically operated this way >>>> (albeit not exactly a power of two because of space related to colocated >>>> allocations for nullability). So trying to constrain memory with record at >>>> a time writing/reallocation will likely turn out pretty poorly. I recommend >>>> you preallocate your batch size based on estimates initially to max memory >>>> and then fill things in and then adjust your estimation algorithm over >>>> time. >>>> >>>> On Thu, Mar 14, 2024, 12:25 PM Greg Lowe <greg.l...@gmail.com> wrote: >>>> >>>>> Hi all, >>>>> >>>>> I'm aiming to reply to the following thread. Not sure if this message >>>>> will appear in the right place. >>>>> https://lists.apache.org/thread/93kg641xk52lm5m11vwodbyc1hzvbnf3 >>>>> >>>>> I've implemented a workaround for a similar use case. I thought >>>>> I'd share, as either someone could recommend a better solution using the >>>>> existing API. Or perhaps to discuss additions to the API which could make >>>>> this easier. >>>>> >>>>> In my use case the limitation is the memory available when reading a >>>>> record batch. I'd like to keep the in-memory size of each record batch >>>>> within a maximum number of bytes. Note, I'm not concerned about the disk >>>>> size (which will be smaller due to LZ4 compression). >>>>> >>>>> So when appending values, I'd like to be able to specify a >>>>> maximum value, say 500MB, and then once that's exceeded write the >>>>> record batch to disk. >>>>> >>>>> The data types I need to support are float64, int64, >>>>> bool, listof(float64), listof(int64), listof(bool), and strings. >>>>> >>>>> In my use case, I'm writing to a builder in a row-wise fashion. My >>>>> current approach is, when I write each cell I increment a variable which >>>>> keeps track of the approximate used memory size in bytes. Luckily, for the >>>>> types I need to support, this is fairly simple to track approximately. >>>>> >>>>> i.e. a float64 is "+8", list-of float64 is "len(floats)*8+8". >>>>> >>>>> Is there a better way to do this using the existing API? >>>>> >>>>> Would it make sense for this to be supported natively by the API? >>>>> >>>>> I'm using the Go implementation. But I guess this applies equally to >>>>> the C++, and maybe other implementations too. >>>>> >>>>> Thanks for taking the time to read this. >>>>> >>>>> Cheers, >>>>> Greg >>>>> >>>>