Each file has 4 batches, if each batch contains 200000 / 4 = 50000 rows by 2000 columns = 100_000_000 8byte floats - about 0.75G if the whole batch is loaded into memory. So my expectation is for memory to be close to the memory for one batch size or so. For this test, I am not creating extra space when reading in my code, I only use one float array of 2000. No Double objects are created. With this, Dataset API throws OutOfMemory exception if there is less than 7GB of RAM. When reading with ArrowFileReader, it throws OutOfMemory exception if there is less than 3GB of RAM. Is there a way to read the file without having it all loaded into RAM?
On Mon, Jan 30, 2023 at 10:19 AM Larry White <[email protected]> wrote: > If you bring the data into the java memory space, you will use a lot of > memory even just for one file: 8 bytes * 200,000 rows * 2000 columns is > 3.2 GB, even without the overhead of converting the values to Double > objects (which could double the required memory). The best approach would > be to leave the data off-heap and access the values using > DataHolders, which should let you access the values using one object per > vector. > > On Mon, Jan 30, 2023 at 10:08 AM Chris Nuernberger <[email protected]> > wrote: > >> TMD <https://github.com/techascent/tech.ml.dataset> supports memory >> mapped arrow files. We don't currently support float8 but I would be >> interested in implementing that if you are interested in trying it out. >> Its Clojure, not java, but is still on the JVM. >> >> This is likely to be your fastest option both in terms of raw performance >> and time to final solution. >> >
