I'm a little bit confused on the benchmark. The benchmark is labeled "open file" and yet "read_table" will read the entire file into memory. I don't think your other benchmarks are doing this (e.g. they are not reading data into memory).
As for the questions on memory mapping, I have a few answers below, but I will give a general answer here. Memory mapped I/O will, at best, save you from one memcpy of the data from kernel space to user space. Memory mapping is not the same as a "lazy dataframe". If you ask Arrow to read a file then it will always load that file off of the disk and into memory. This is true if you used memory mapped I/O or not. If you ask it to load a single column, then it will not load the entire file, but instead load a single column. There are many other libraries that add "lazy dataframe" capabilities on top of Arrow files. What is it that you would like to achieve with Arrow? > According to the benchmark, the fonction to_pandas is loading all the data into memory. Do you agree or did I miss something ? Yes. to_pandas will load the entire file into memory. > When you open an Arrow IPC file with memory mapping and add a column, does it write the column on disk ? If you open any existing file with memory mapping it's generally assumed it will be read only. In theory, you could memory map a larger space, and then write into it over time, but none of the core Arrow utilities are going to do anything like that. > When opening a compressed Arrow IPC file, what does memory mapping means ? What is the difference with opening the same file without memory mapping ? This means that you will be able to avoid a memcpy of the compressed bytes from kernel space to user space. On Sun, May 21, 2023 at 10:32 AM Frédéric MASSON <[email protected]> wrote: > Hello everyone, > > For several years I have been working with HDF5 files to store/load > information and pandas as in-memory representation to analyze them. > Globally the data can be of variable sizes (from a few MB to 10GB). I use > the dataframes inside interactive tools (with a GUI, where the data access > is quite random) and non-interactive tools (scripts), everything is in > Python but the files could be opened in other languages. The typical use > case is to get only some columns of the file, doing some operations on them > and plot the result. Since the files are quite big, data compression is > quite important for me to save disk-space. However writing duration is not > very important. > Of course, for the big files I faced the same performances issues as a lot > of people : > 1. Access some columns with a row oriented file is quite inefficient > 2. loading 10GB of data into memory is long, generally not necessary and > can be larger than RAM capacity on some machines. > > In order to face this issues, I came to a simple conclusion : > 1. The memory should be column oriented > 2. The in-memory layout should be the same as the on-disk memory. I am > very interested in memory mapping since it allows me access files very > quickly (there is no loading time) and open file larger than memory. > > The solution I implemented is quite simple > 1. I compress the data inside a HDF5 dataset with vertical chunks (nrows x > 1) with the Blosc compressor (not Blosc2). HDF5 is a great container for > data, that allow to chunk data with the shape the user want. Vertical chunk > allows to decompress each column individually without decompressing the > others. Inside the file, the columns names are stored inside the > user-defined metadata of the dataset. > 2. With h5py I just open the HDF5 file and manipulate the h5py dataset > object without reading its content. This way, I am doing a "memory-map" of > a compressed file (or a "lazy" access I guess). When I access to a column, > then the h5py actually reads and decompress the data on-the-fly but is > totally transparent for me. This is not a zero-copy mechanism but I can > access the data copying only the interesting data. > > The main goal with this "solution" is to reduce the time when a user opens > a file and to reduce a lot the RAM usage. > > In order to access the columns with their names I made a small python > library with a class that redefines the access operators. It is not a very > handy library and right now I am considering transforming this class into a > Pandas ExtensionArray. I am not sure but I think it would allow me to use > the pandas dataframe class on a h5py dataset instead of a numpy array. > > I am also considering using Apache Arrow instead. That is why, recently I > have been busy reading the Arrow documentation, the format specification > and some blog articles. I must say that this library seems wonderful, I > particularly love the fact that it tackle the problem of copying data and > it is available in several languages. The zero-copy policy is exactly what > I am looking for ! I also like the general format allowing to have columns > of different types, nested columns and metadata for each columns. HDF5 does > not allow to do all this. > The documentation is quite heavy and I cannot say I understand everything. > So I tried it ! > Actually I compared Arrow with my home-made solution in my use case (so > not a very fair benchmark, I agree on that). With several lib/formats, this > benchmark measures time and memory usage while it > 1. creates a table (100000*5000) > 2. writes it on disk > 3. opens the file > 4. computes a sum and a product a stores the result > > > > You must be careful with the memory usage I wrote. For pyArrow I used the > Arrow memory pool information and for the rest I used tracemalloc but it > may not be very accurate. The memory usage just tells me if the entire > dataset is loaded or not. > > My questions are coming :) ! > At first I was wondering how the memory mapping worked when converted to > pandas dataframe. According to the benchmark, the fonction to_pandas is > loading all the data into memory. > Do you agree or did I miss something ? > When you open an Arrow IPC file with memory mapping and add a column, does > it write the column on disk ? > When opening a compressed Arrow IPC file, what does memory mapping means ? > What is the difference with opening the same file without memory mapping ? > > Have you considered implemented a "lazy-reading" of compressed data ? > Would it be relevant for the Arrow project ? > I read the format specification ( > https://github.com/apache/arrow/blob/main/format/Message.fbs ) and I > think only the data can be compressed. Not the metadata, I am wrong ? > I also found the CompressedInputStream and the CompressedOutputStream. Is > it some low level object compared to the write_feather ? Does write_feather > use these objects ? > > Do you think Arrow could be a solution for my use case ? > > I simplified my benchmark and the source code is in attachment. Do you see > it ? > > Some remarks : > - At first, I tried PyTables but I faced too many issues. > > - I really like HDF5 because I can store several datasets (Tables) and > organize them. For example, my simulation is giving me binary data and a > log file (text), so inside my HDF5 file I am gathering everything linked so > this simulation run : the sources files, the binary data and the log file. > If I store the log and the binary separately I may not be able to make the > connection between them later. I also like HDF5 for all the compressors > available, especially the very interesting blosc compressor that is, I > think, doing a job very complementary to what arrow is doing. > > -For the benchmarks, the files were stored on my hard-drive. I tried > storing them on my SSD and operations with the "memory mapped" HDF5 were > approximately 10x faster. > > If something is not clear or if you want more details please tell me, > > Best regards, > > Fred >
