Re: [Arrow IPC] memory mapping of compressed file / lazy reading

2023-06-03 Thread Frédéric MASSON
I agree with you, the amount of memory used depends on the user behavior but that is the point : to have only what the user is using and not. And I also agree, even with memory mapping, the disk load can still be present. For example, I noticed that with the code below the read_all call with

Re: [Arrow IPC] memory mapping of compressed file / lazy reading

2023-05-26 Thread Weston Pace
Thanks for the clarification, I understand your use case better now. You are right that memory mapping can be used in this way you describe. > why does it decompresses the data here ? For me it is doing a unnecessary copy > by transforming a compressed record batch into a uncompressed record

Re: [Arrow IPC] memory mapping of compressed file / lazy reading

2023-05-25 Thread Frédéric MASSON
Hi, Thank you very much for your answer, I am sorry if some sentences are confusing. I did not know about the kernel space/user space and that memory mapping I/O was more general than just file memory mapping. I got a better understanding now. So I looked a bit deeper inside memory mapping

Re: [Arrow IPC] memory mapping of compressed file / lazy reading

2023-05-22 Thread Weston Pace
Well, I suppose there are cases where you can map a file with memory mapped I/O and then, if you are careful not to touch those buffers, they might not be loaded into memory. However, that is a very difficult thing to achieve. For example, when reading a file we need to access the metadata that

Re: [Arrow IPC] memory mapping of compressed file / lazy reading

2023-05-22 Thread Weston Pace
I'm a little bit confused on the benchmark. The benchmark is labeled "open file" and yet "read_table" will read the entire file into memory. I don't think your other benchmarks are doing this (e.g. they are not reading data into memory). As for the questions on memory mapping, I have a few

[Arrow IPC] memory mapping of compressed file / lazy reading

2023-05-21 Thread Frédéric MASSON
Hello everyone, For several years I have been working with HDF5 files to store/load information and pandas as in-memory representation to analyze them. Globally the data can be of variable sizes (from a few MB to 10GB). I use the dataframes inside interactive tools (with a GUI, where the data