[Arrow IPC] memory mapping of compressed file / lazy reading

Frédéric MASSON Sun, 21 May 2023 10:32:04 -0700

Hello everyone,

For several years I have been working with HDF5 files to store/loadinformation and pandas as in-memory representation to analyze them.Globally the data can be of variable sizes (from a few MB to 10GB). Iuse the dataframes inside interactive tools (with a GUI, where the dataaccess is quite random) and non-interactive tools (scripts), everythingis in Python but the files could be opened in other languages. Thetypical use case is to get only some columns of the file, doing someoperations on them and plot the result. Since the files are quite big,data compression is quite important for me to save disk-space. Howeverwriting duration is not very important.Of course, for the big files I faced the same performances issues as alot of people :

1. Access some columns with a row oriented file is quite inefficient

2. loading 10GB of data into memory is long, generally not necessary andcan be larger than RAM capacity on some machines.


In order to face this issues, I came to a simple conclusion :
1. The memory should be column oriented

2. The in-memory layout should be the same as the on-disk memory. I amvery interested in memory mapping since it allows me access files veryquickly (there is no loading time) and open file larger than memory.


The solution I implemented is quite simple

1. I compress the data inside a HDF5 dataset with vertical chunks (nrowsx 1) with the Blosc compressor (not Blosc2). HDF5 is a great containerfor data, that allow to chunk data with the shape the user want.Vertical chunk allows to decompress each column individually withoutdecompressing the others. Inside the file, the columns names are storedinside the user-defined metadata of the dataset.2. With h5py I just open the HDF5 file and manipulate the h5py datasetobject without reading its content. This way, I am doing a "memory-map"of a compressed file (or a "lazy" access I guess). When I access to acolumn, then the h5py actually reads and decompress the data on-the-flybut is totally transparent for me. This is not a zero-copy mechanism butI can access the data copying only the interesting data.

The main goal with this "solution" is to reduce the time when a useropens a file and to reduce a lot the RAM usage.

In order to access the columns with their names I made a small pythonlibrary with a class that redefines the access operators. It is not avery handy library and right now I am considering transforming thisclass into a Pandas ExtensionArray. I am not sure but I think it wouldallow me to use the pandas dataframe class on a h5py dataset instead ofa numpy array.

I am also considering using Apache Arrow instead. That is why, recentlyI have been busy reading the Arrow documentation, the formatspecification and some blog articles. I must say that this library seemswonderful, I particularly love the fact that it tackle the problem ofcopying data and it is available in several languages. The zero-copypolicy is exactly what I am looking for ! I also like the general formatallowing to have columns of different types, nested columns and metadatafor each columns. HDF5 does not allow to do all this.

The documentation is quite heavy and I cannot say I understand everything.
So I tried it !

Actually I compared Arrow with my home-made solution in my use case (sonot a very fair benchmark, I agree on that). With several lib/formats,this benchmark measures time and memory usage while it

1. creates a table (100000*5000)
2. writes it on disk
3. opens the file
4. computes a sum and a product a stores the result

You must be careful with the memory usage I wrote. For pyArrow I usedthe Arrow memory pool information and for the rest I used tracemallocbut it may not be very accurate. The memory usage just tells me if theentire dataset is loaded or not.


My questions are coming :) !

At first I was wondering how the memory mapping worked when converted topandas dataframe. According to the benchmark, the fonction to_pandas isloading all the data into memory.

Do you agree or did I miss something ?

When you open an Arrow IPC file with memory mapping and add a column,does it write the column on disk ?When opening a compressed Arrow IPC file, what does memory mapping means? What is the difference with opening the same file without memory mapping ?


Have you considered implemented a "lazy-reading" of compressed data ?
Would it be relevant for the Arrow project ?

I read the format specification(https://github.com/apache/arrow/blob/main/format/Message.fbs ) and Ithink only the data can be compressed. Not the metadata, I am wrong ?I also found the CompressedInputStream and the CompressedOutputStream.Is it some low level object compared to the write_feather ? Doeswrite_feather use these objects ?


Do you think Arrow could be a solution for my use case ?

I simplified my benchmark and the source code is in attachment. Do yousee it ?


Some remarks :
- At first, I tried PyTables but I faced too many issues.

- I really like HDF5 because I can store several datasets (Tables) andorganize them. For example, my simulation is giving me binary data and alog file (text), so inside my HDF5 file I am gathering everything linkedso this simulation run : the sources files, the binary data and the logfile. If I store the log and the binary separately I may not be able tomake the connection between them later. I also like HDF5 for all thecompressors available, especially the very interesting blosc compressorthat is, I think, doing a job very complementary to what arrow is doing.

-For the benchmarks, the files were stored on my hard-drive. I triedstoring them on my SSD and operations with the "memory mapped" HDF5 wereapproximately 10x faster.


If something is not clear or if you want more details please tell me,

Best regards,

Fred

<<attachment: bench_arrow_simplified.zip>>

[Arrow IPC] memory mapping of compressed file / lazy reading

Reply via email to