Re: [Arrow IPC] memory mapping of compressed file / lazy reading

Weston Pace Fri, 26 May 2023 05:19:07 -0700

Thanks for the clarification, I understand your use case better now.  You
are right that memory mapping can be used in this way you describe.


> why does it decompresses the data here ? For me it is doing a unnecessary
copy
> by transforming a compressed record batch into a uncompressed record
batch.
> Could not a table (or a record batch) be composed of array with
compressed buffer ?

I think what you are describing is possible with the Arrow format.  You
could wait to do decompression until the buffer is first accessed.
However, as you have noticed, it is not the way the Arrow-C++ library is
currently implemented.

You can achieve something similar by not loading the column until you
actually need it.  First, load an empty table (0 columns) from the Arrow
file (so you can get the schema and length) and create a lazy dataframe.
Then, when the column is accessed, you can create another Arrow reader and
read just that column.

So I suspect this is simply a design decision.  It should be possible to
build what you are describing but it is not how things are currently
implemented.  One caution I would have with memory mapping in this way is
that the disk load becomes something of a hidden cost that is harder to
reason about.  It's also more difficult to manage how much memory you are
currently using because it depends on the user's access patterns.  Maybe
deleting an array frees up ram and maybe it doesn't?

On Thu, May 25, 2023 at 2:18 PM Frédéric MASSON <masson-frede...@hotmail.fr>
wrote:

> Hi,
>
> Thank you very much for your answer, I am sorry if some sentences are
> confusing.
> I did not know about the kernel space/user space and that memory mapping
> I/O was more general than just file memory mapping. I got a better
> understanding now.
> So I looked a bit deeper inside memory mapping (
> https://en.wikipedia.org/wiki/Mmap and
> https://www.ibm.com/docs/en/aix/7.2?topic=memory-understanding-mapping).
>
> I agree the term "lazy" can be ambiguous. I was talking about "lazy
> reading" (which is memory mapping a file for me) and not "lazy computing."
>
> > What is it that you would like to achieve with Arrow?
>
> The ideal use case would be having a Pandas DataFrame object with data
> directly on the file and not in memory : a DataFrame but with a backend
> that only actually read the necessary data in the file and only when needed.
> I have found that memory mapping a file is a very efficient way to open a
> file and to access its data partially. For me it is an essential part of a
> the "zero-copy" strategy. Especially with the last-gen SSD.
> Since Arrow allows to share memory between libraries without copy I
> thought I would be able to have a pandas dataFrame without actually loading
> the data.
>
> And I actually found a way to have a Pandas dataframe with all the data
> memory mapped from the disk with the "types_mapper=pd.ArrowDtype" argument
> :
>
>     with pa.memory_map('table.arrow', 'rb') as source:
>         df =
> pa.ipc.open_file(source).read_all().to_pandas(types_mapper=pd.ArrowDtype)
>
>
> To go further I wanted to do the same with a compressed arrow file. But I
> did not success. When reading batch records, the Arrow library *always 
> *uncompress
> the buffers inside the batch record.
>
> I looked a bit into the Arrow sources files on github and I think it comes
> from the function "LoadRecordBatchSubset" (
> https://github.com/apache/arrow/blob/2d32efeedad88743dd635ff562c65e072cfb44f7/cpp/src/arrow/ipc/reader.cc#L522
> ).
> This function just "move" all the columns unless their data are compressed
> in which case it decompresses the data and thus loads it into memory (with
> DecompressBuffers function).
>
> So my question is  : why does it decompresses the data here ? For me it is
> doing a unnecessary copy by transforming a compressed record batch into a
> uncompressed record batch.
> Could not a table (or a record batch) be composed of array with compressed
> buffer ?
>
> Thank you again,
>
> Fred
>
>
> On 22/05/2023 at 22:00, Weston Pace wrote:
>
> Well, I suppose there are cases where you can map a file with memory
> mapped I/O and then, if you are careful not to touch those buffers, they
> might not be loaded into memory.  However, that is a very difficult thing
> to achieve.  For example, when reading a file we need to access the
> metadata that is scattered throughout the file.  This will trigger that
> data to be loaded.  The OS will then also typically load some amount of
> memory ahead of the data you requested. Also, it's very easy to trigger
> some kind of scan through the data (e.g. to count how many nulls there are)
> which might cause that data to be loaded.
>
> But I was inaccurate in my earlier statement, it is possible to use memory
> mapped I/O alone to achieve some kinds of lazy loading.  I suppose that is
> why read_table gets faster in your benchmark (I missed that).  It will
> still need to read some data (all of the metadata for example) from disk.
> I guess I am a little surprised by 4.6s but we could dig into that.
>
> Also, compression will force the data to be loaded because of the way
> read_table works.
>
> I think most current users achieve lazy loading by selectively loading the
> data they need and not by loading the entire table with memory mapping and
> avoiding access to the data they don't need.
>
> On Mon, May 22, 2023 at 12:51 PM Weston Pace <weston.p...@gmail.com>
> wrote:
>
>> I'm a little bit confused on the benchmark.  The benchmark is labeled
>> "open file" and yet "read_table" will read the entire file into memory.  I
>> don't think your other benchmarks are doing this (e.g. they are not reading
>> data into memory).
>>
>> As for the questions on memory mapping, I have a few answers below, but I
>> will give a general answer here.  Memory mapped I/O will, at best, save you
>> from one memcpy of the data from kernel space to user space.  Memory
>> mapping is not the same as a "lazy dataframe".  If you ask Arrow to read a
>> file then it will always load that file off of the disk and into memory.
>> This is true if you used memory mapped I/O or not.  If you ask it to load a
>> single column, then it will not load the entire file, but instead load a
>> single column.  There are many other libraries that add "lazy dataframe"
>> capabilities on top of Arrow files.
>>
>> What is it that you would like to achieve with Arrow?
>>
>> > According to the benchmark, the fonction to_pandas is loading all the
>> data into memory.
>> Do you agree or did I miss something ?
>>
>> Yes.  to_pandas will load the entire file into memory.
>>
>> > When you open an Arrow IPC file with memory mapping and add a column,
>> does it write the column on disk ?
>>
>> If you open any existing file with memory mapping it's generally assumed
>> it will be read only.  In theory, you could memory map a larger space, and
>> then write into it over time, but none of the core Arrow utilities are
>> going to do anything like that.
>>
>> > When opening a compressed Arrow IPC file, what does memory mapping
>> means ? What is the difference with opening the same file without memory
>> mapping ?
>>
>> This means that you will be able to avoid a memcpy of the compressed
>> bytes from kernel space to user space.
>>
>> On Sun, May 21, 2023 at 10:32 AM Frédéric MASSON <
>> masson-frede...@hotmail.fr> wrote:
>>
>>> Hello everyone,
>>>
>>> For several years I have been working with HDF5 files to store/load
>>> information and pandas as in-memory representation to analyze them.
>>> Globally the data can be of variable sizes (from a few MB to 10GB). I use
>>> the dataframes inside interactive tools (with a GUI, where the data access
>>> is quite random) and non-interactive tools (scripts), everything is in
>>> Python but the files could be opened in other languages. The typical use
>>> case is to get only some columns of the file, doing some operations on them
>>> and plot the result. Since the files are quite big, data compression is
>>> quite important for me to save disk-space. However writing duration is not
>>> very important.
>>> Of course, for the big files I faced the same performances issues as a
>>> lot of people :
>>> 1. Access some columns with a row oriented file is quite inefficient
>>> 2. loading 10GB of data into memory is long, generally not necessary and
>>> can be larger than RAM capacity on some machines.
>>>
>>> In order to face this issues, I came to a simple conclusion :
>>> 1. The memory should be column oriented
>>> 2. The in-memory layout should be the same as the on-disk memory. I am
>>> very interested in memory mapping since it allows me access files very
>>> quickly (there is no loading time) and open file larger than memory.
>>>
>>> The solution I implemented is quite simple
>>> 1. I compress the data inside a HDF5 dataset with vertical chunks (nrows
>>> x 1) with the Blosc compressor (not Blosc2). HDF5 is a great container for
>>> data, that allow to chunk data with the shape the user want. Vertical chunk
>>> allows to decompress each column individually without decompressing the
>>> others. Inside the file, the columns names are stored inside the
>>> user-defined metadata of the dataset.
>>> 2. With h5py I just open the HDF5 file and manipulate the h5py dataset
>>> object without reading its content. This way, I am doing a "memory-map" of
>>> a compressed file (or a "lazy" access I guess). When I access to a column,
>>> then the h5py actually reads and decompress the data on-the-fly but is
>>> totally transparent for me. This is not a zero-copy mechanism but I can
>>> access the data copying only the interesting data.
>>>
>>> The main goal with this "solution" is to reduce the time when a user
>>> opens a file and to reduce a lot the RAM usage.
>>>
>>> In order to access the columns with their names I made a small python
>>> library with a class that redefines the access operators. It is not a very
>>> handy library and right now I am considering transforming this class into a
>>> Pandas ExtensionArray. I am not sure but I think it would allow me to use
>>> the pandas dataframe class on a h5py dataset instead of a numpy array.
>>>
>>> I am also considering using Apache Arrow instead. That is why, recently
>>> I have been busy reading the Arrow documentation, the format specification
>>> and some blog articles. I must say that this library seems wonderful, I
>>> particularly love the fact that it tackle the problem of copying data and
>>> it is available in several languages. The zero-copy policy is exactly what
>>> I am looking for ! I also like the general format allowing to have columns
>>> of different types, nested columns and metadata for each columns. HDF5 does
>>> not allow to do all this.
>>> The documentation is quite heavy and I cannot say I understand
>>> everything.
>>> So I tried it !
>>> Actually I compared Arrow with my home-made solution in my use case (so
>>> not a very fair benchmark, I agree on that). With several lib/formats, this
>>> benchmark measures time and memory usage while it
>>> 1. creates a table (100000*5000)
>>> 2. writes it on disk
>>> 3. opens the file
>>> 4. computes a sum and a product a stores the result
>>>
>>>
>>>
>>> You must be careful with the memory usage I wrote. For pyArrow I used
>>> the Arrow memory pool information and for the rest I used tracemalloc but
>>> it may not be very accurate. The memory usage just tells me if the entire
>>> dataset is loaded or not.
>>>
>>> My questions are coming :) !
>>> At first I was wondering how the memory mapping worked when converted to
>>> pandas dataframe. According to the benchmark, the fonction to_pandas is
>>> loading all the data into memory.
>>> Do you agree or did I miss something ?
>>> When you open an Arrow IPC file with memory mapping and add a column,
>>> does it write the column on disk ?
>>> When opening a compressed Arrow IPC file, what does memory mapping means
>>> ? What is the difference with opening the same file without memory mapping ?
>>>
>>> Have you considered implemented a "lazy-reading" of compressed data ?
>>> Would it be relevant for the Arrow project ?
>>> I read the format specification (
>>> https://github.com/apache/arrow/blob/main/format/Message.fbs ) and I
>>> think only the data can be compressed. Not the metadata, I am wrong ?
>>> I also found the CompressedInputStream and the CompressedOutputStream.
>>> Is it some low level object compared to the write_feather ? Does
>>> write_feather use these objects ?
>>>
>>> Do you think Arrow could be a solution for my use case ?
>>>
>>> I simplified my benchmark and the source code is in attachment. Do you
>>> see it ?
>>>
>>> Some remarks :
>>> - At first, I tried PyTables but I faced too many issues.
>>>
>>> - I really like HDF5 because I can store several datasets (Tables) and
>>> organize them. For example, my simulation is giving me binary data and a
>>> log file (text), so inside my HDF5 file I am gathering everything linked so
>>> this simulation run : the sources files, the binary data and the log file.
>>> If I store the log and the binary separately I may not be able to make the
>>> connection between them later. I also like HDF5 for all the compressors
>>> available, especially the very interesting blosc compressor that is, I
>>> think, doing a job very complementary to what arrow is doing.
>>>
>>> -For the benchmarks, the files were stored on my hard-drive. I tried
>>> storing them on my SSD and operations with the "memory mapped" HDF5 were
>>> approximately 10x faster.
>>>
>>> If something is not clear or if you want more details please tell me,
>>>
>>> Best regards,
>>>
>>> Fred
>>>
>>

Re: [Arrow IPC] memory mapping of compressed file / lazy reading

Reply via email to