I recommend that you direct these questions to u...@arrow.apache.org (https://mail-archives.apache.org/mod_mbox/arrow-user/).
On Fri, Jan 29, 2021 at 7:07 AM Joris Peeters <joris.mg.peet...@gmail.com> wrote: > > Hello, > > I'm writing an HTTP server in Java that provides Arrow data to users. For > performance, I keep the most-recently-used Arrow batches in an in-memory > cache. A batch is wrapped in a "DataBatch" Java object containing the > schema and field vectors. > > I'm looking for a good memory management strategy here, given the situation > that, > - batches can be evicted the in-memory cache, and the underlying memory > should be cleared up as quickly as possible, *if nothing else is using them* > , > - data retrieved from the cache undergoes a zero-copy path with filters etc > (which are views on the underlying data) before being sent out, so it can > still be in use when it gets cache-evicted, as there are multiple > simultaneous threads. > > I'm used to C++, where this scenario would seem relatively unchallenging, > as we'd keep std::shared_ptr's and just clean up everything in the > destructor. > > In Java, however, it seems that, > - Object#finalize is deprecated, and not super-reliable anyway, > - GC might only happen when there is pressure on the Java heap, but the > Arrow data is allocated in Netty buffers. > > I wonder if people have encountered this scenario before, and what approach > was favoured. Some ideas, > - Manually maintain a ref-count and free when it goes to zero. This seems > brittle in the face of errors etc, that could lead to leaks, > - Use the PhantomReference mechanism. Would appear to suffer from the same > potential delay in GC, though, i.e. my Java object is just a little holder > for the underlying FieldVectors. Perhaps there's a way of saying that > these DataBatch object should be GC'd often? > - Make a copy of the data when it gets retrieved from the cache, so an > eviction from the cache means it can always be safely removed. Seems very > wasteful, and not very scalable if there are other reusal paths. > - Allocate the buffers in a way that counts towards heap memory pressure. > > Any thoughts are appreciated! I'm not a Java expert at all, so may be > missing obvious things, or thinking about it non-idiomatically. > > Best, > -J