Re: [Arrow IPC] memory mapping of compressed file / lazy reading

Weston Pace Mon, 22 May 2023 12:51:34 -0700

I'm a little bit confused on the benchmark.  The benchmark is labeled "open
file" and yet "read_table" will read the entire file into memory.  I don't
think your other benchmarks are doing this (e.g. they are not reading data
into memory).


As for the questions on memory mapping, I have a few answers below, but I
will give a general answer here.  Memory mapped I/O will, at best, save you
from one memcpy of the data from kernel space to user space.  Memory
mapping is not the same as a "lazy dataframe".  If you ask Arrow to read a
file then it will always load that file off of the disk and into memory.
This is true if you used memory mapped I/O or not.  If you ask it to load a
single column, then it will not load the entire file, but instead load a
single column.  There are many other libraries that add "lazy dataframe"
capabilities on top of Arrow files.

What is it that you would like to achieve with Arrow?

> According to the benchmark, the fonction to_pandas is loading all the
data into memory.
Do you agree or did I miss something ?

Yes.  to_pandas will load the entire file into memory.

> When you open an Arrow IPC file with memory mapping and add a column,
does it write the column on disk ?

If you open any existing file with memory mapping it's generally assumed it
will be read only.  In theory, you could memory map a larger space, and
then write into it over time, but none of the core Arrow utilities are
going to do anything like that.

> When opening a compressed Arrow IPC file, what does memory mapping means
? What is the difference with opening the same file without memory mapping ?

This means that you will be able to avoid a memcpy of the compressed bytes
from kernel space to user space.

On Sun, May 21, 2023 at 10:32 AM Frédéric MASSON <[email protected]>
wrote:

> Hello everyone,
>
> For several years I have been working with HDF5 files to store/load
> information and pandas as in-memory representation to analyze them.
> Globally the data can be of variable sizes (from a few MB to 10GB). I use
> the dataframes inside interactive tools (with a GUI, where the data access
> is quite random) and non-interactive tools (scripts), everything is in
> Python but the files could be opened in other languages. The typical use
> case is to get only some columns of the file, doing some operations on them
> and plot the result. Since the files are quite big, data compression is
> quite important for me to save disk-space. However writing duration is not
> very important.
> Of course, for the big files I faced the same performances issues as a lot
> of people :
> 1. Access some columns with a row oriented file is quite inefficient
> 2. loading 10GB of data into memory is long, generally not necessary and
> can be larger than RAM capacity on some machines.
>
> In order to face this issues, I came to a simple conclusion :
> 1. The memory should be column oriented
> 2. The in-memory layout should be the same as the on-disk memory. I am
> very interested in memory mapping since it allows me access files very
> quickly (there is no loading time) and open file larger than memory.
>
> The solution I implemented is quite simple
> 1. I compress the data inside a HDF5 dataset with vertical chunks (nrows x
> 1) with the Blosc compressor (not Blosc2). HDF5 is a great container for
> data, that allow to chunk data with the shape the user want. Vertical chunk
> allows to decompress each column individually without decompressing the
> others. Inside the file, the columns names are stored inside the
> user-defined metadata of the dataset.
> 2. With h5py I just open the HDF5 file and manipulate the h5py dataset
> object without reading its content. This way, I am doing a "memory-map" of
> a compressed file (or a "lazy" access I guess). When I access to a column,
> then the h5py actually reads and decompress the data on-the-fly but is
> totally transparent for me. This is not a zero-copy mechanism but I can
> access the data copying only the interesting data.
>
> The main goal with this "solution" is to reduce the time when a user opens
> a file and to reduce a lot the RAM usage.
>
> In order to access the columns with their names I made a small python
> library with a class that redefines the access operators. It is not a very
> handy library and right now I am considering transforming this class into a
> Pandas ExtensionArray. I am not sure but I think it would allow me to use
> the pandas dataframe class on a h5py dataset instead of a numpy array.
>
> I am also considering using Apache Arrow instead. That is why, recently I
> have been busy reading the Arrow documentation, the format specification
> and some blog articles. I must say that this library seems wonderful, I
> particularly love the fact that it tackle the problem of copying data and
> it is available in several languages. The zero-copy policy is exactly what
> I am looking for ! I also like the general format allowing to have columns
> of different types, nested columns and metadata for each columns. HDF5 does
> not allow to do all this.
> The documentation is quite heavy and I cannot say I understand everything.
> So I tried it !
> Actually I compared Arrow with my home-made solution in my use case (so
> not a very fair benchmark, I agree on that). With several lib/formats, this
> benchmark measures time and memory usage while it
> 1. creates a table (100000*5000)
> 2. writes it on disk
> 3. opens the file
> 4. computes a sum and a product a stores the result
>
>
>
> You must be careful with the memory usage I wrote. For pyArrow I used the
> Arrow memory pool information and for the rest I used tracemalloc but it
> may not be very accurate. The memory usage just tells me if the entire
> dataset is loaded or not.
>
> My questions are coming :) !
> At first I was wondering how the memory mapping worked when converted to
> pandas dataframe. According to the benchmark, the fonction to_pandas is
> loading all the data into memory.
> Do you agree or did I miss something ?
> When you open an Arrow IPC file with memory mapping and add a column, does
> it write the column on disk ?
> When opening a compressed Arrow IPC file, what does memory mapping means ?
> What is the difference with opening the same file without memory mapping ?
>
> Have you considered implemented a "lazy-reading" of compressed data ?
> Would it be relevant for the Arrow project ?
> I read the format specification (
> https://github.com/apache/arrow/blob/main/format/Message.fbs ) and I
> think only the data can be compressed. Not the metadata, I am wrong ?
> I also found the CompressedInputStream and the CompressedOutputStream. Is
> it some low level object compared to the write_feather ? Does write_feather
> use these objects ?
>
> Do you think Arrow could be a solution for my use case ?
>
> I simplified my benchmark and the source code is in attachment. Do you see
> it ?
>
> Some remarks :
> - At first, I tried PyTables but I faced too many issues.
>
> - I really like HDF5 because I can store several datasets (Tables) and
> organize them. For example, my simulation is giving me binary data and a
> log file (text), so inside my HDF5 file I am gathering everything linked so
> this simulation run : the sources files, the binary data and the log file.
> If I store the log and the binary separately I may not be able to make the
> connection between them later. I also like HDF5 for all the compressors
> available, especially the very interesting blosc compressor that is, I
> think, doing a job very complementary to what arrow is doing.
>
> -For the benchmarks, the files were stored on my hard-drive. I tried
> storing them on my SSD and operations with the "memory mapped" HDF5 were
> approximately 10x faster.
>
> If something is not clear or if you want more details please tell me,
>
> Best regards,
>
> Fred
>

Re: [Arrow IPC] memory mapping of compressed file / lazy reading

Reply via email to