If you bring the data into the java memory space, you will use a lot of memory even just for one file: 8 bytes * 200,000 rows * 2000 columns is 3.2 GB, even without the overhead of converting the values to Double objects (which could double the required memory). The best approach would be to leave the data off-heap and access the values using DataHolders, which should let you access the values using one object per vector.
On Mon, Jan 30, 2023 at 10:08 AM Chris Nuernberger <[email protected]> wrote: > TMD <https://github.com/techascent/tech.ml.dataset> supports memory > mapped arrow files. We don't currently support float8 but I would be > interested in implementing that if you are interested in trying it out. > Its Clojure, not java, but is still on the JVM. > > This is likely to be your fastest option both in terms of raw performance > and time to final solution. >
