Re: [Arrow IPC] memory mapping of compressed file / lazy reading

Frédéric MASSON Sat, 03 Jun 2023 11:43:19 -0700

I agree with you, the amount of memory used depends on the user behaviorbut that is the point : to have only what the user is using and not.

And I also agree, even with memory mapping, the disk load can still bepresent.For example, I noticed that with the code below the read_all call withgenerate an important reading of the disk (the whole file I suppose)even if nothing is loaded into RAM (and nothing seems written anywhere).This important reading happens only one time and if I start a new pythoninstance and execute the same code this reading does not occur.

You are right, only reading the columns the user accesses is probablythe best thing to do here, this way the memory allocated is perfectlycontrolled. I just thought that memory mapping will do thistransparently for me without being so unpredictable.I'll dig into you suggestion when I have more time. I also want to diginto vaex.


Thank you again for your answers, you helped me see things clearer here.


Le 26/05/2023 à 14:18, Weston Pace a écrit :

Thanks for the clarification, I understand your use case better now. You are right that memory mapping can be used in this way you describe.

> why does it decompresses the data here ? For me it is doing aunnecessary copy> by transforming a compressed record batch into a uncompressed recordbatch.> Could not a table (or a record batch) be composed of array withcompressed buffer ?

I think what you are describing is possible with the Arrow format. You could wait to do decompression until the buffer is firstaccessed. However, as you have noticed, it is not the way theArrow-C++ library is currently implemented.

You can achieve something similar by not loading the column until youactually need it. First, load an empty table (0 columns) from theArrow file (so you can get the schema and length) and create a lazydataframe. Then, when the column is accessed, you can create anotherArrow reader and read just that column.

So I suspect this is simply a design decision. It should be possibleto build what you are describing but it is not how things arecurrently implemented. One caution I would have with memory mappingin this way is that the disk load becomes something of a hidden costthat is harder to reason about. It's also more difficult to manage howmuch memory you are currently using because it depends on the user'saccess patterns. Maybe deleting an array frees up ram and maybe itdoesn't?

On Thu, May 25, 2023 at 2:18 PM Frédéric MASSON<[email protected]> wrote:

Hi,

Thank you very much for your answer, I am sorry if some sentences
are confusing.

I did not know about the kernel space/user space and that memory
mapping I/O was more general than just file memory mapping. I got
a better understanding now.
So I looked a bit deeper inside memory mapping (
https://en.wikipedia.org/wiki/Mmap and
https://www.ibm.com/docs/en/aix/7.2?topic=memory-understanding-mapping).

I agree the term "lazy" can be ambiguous. I was talking about
"lazy reading" (which is memory mapping a file for me) and not
"lazy computing."

> What is it that you would like to achieve with Arrow?

The ideal use case would be having a Pandas DataFrame object with
data directly on the file and not in memory : a DataFrame but with
a backend that only actually read the necessary data in the file
and only when needed.
I have found that memory mapping a file is a very efficient way to
open a file and to access its data partially. For me it is an
essential part of a the "zero-copy" strategy. Especially with the
last-gen SSD.
Since Arrow allows to share memory between libraries without copy
I thought I would be able to have a pandas dataFrame without
actually loading the data.

And I actually found a way to have a Pandas dataframe with all the
data memory mapped from the disk with the
"types_mapper=pd.ArrowDtype" argument :

with pa.memory_map('table.arrow', 'rb') as source:
df =
pa.ipc.open_file(source).read_all().to_pandas(types_mapper=pd.ArrowDtype)

To go further I wanted to do the same with a compressed arrow
file. But I did not success. When reading batch records, the Arrow
library *always *uncompress the buffers inside the batch record.

I looked a bit into the Arrow sources files on github and I think
it comes from the function "LoadRecordBatchSubset"

(https://github.com/apache/arrow/blob/2d32efeedad88743dd635ff562c65e072cfb44f7/cpp/src/arrow/ipc/reader.cc#L522
).
This function just "move" all the columns unless their data are
compressed in which case it decompresses the data and thus loads
it into memory (with DecompressBuffers function).

So my question is : why does it decompresses the data here ? For
me it is doing a unnecessary copy by transforming a compressed
record batch into a uncompressed record batch.
Could not a table (or a record batch) be composed of array with
compressed buffer ?

Thank you again,

Fred

On 22/05/2023 at 22:00, Weston Pace wrote:

    Well, I suppose there are cases where you can map a file with
    memory mapped I/O and then, if you are careful not to touch those
    buffers, they might not be loaded into memory.  However, that is
    a very difficult thing to achieve.  For example, when reading a
    file we need to access the metadata that is scattered throughout
    the file.  This will trigger that data to be loaded.  The OS will
    then also typically load some amount of memory ahead of the data
    you requested. Also, it's very easy to trigger some kind of scan
    through the data (e.g. to count how many nulls there are) which
    might cause that data to be loaded.

    But I was inaccurate in my earlier statement, it is possible to
    use memory mapped I/O alone to achieve some kinds of lazy
    loading.  I suppose that is why read_table gets faster in your
    benchmark (I missed that).  It will still need to read some data
    (all of the metadata for example) from disk.  I guess I am a
    little surprised by 4.6s but we could dig into that.

    Also, compression will force the data to be loaded because of the
    way read_table works.

    I think most current users achieve lazy loading by selectively
    loading the data they need and not by loading the entire table
    with memory mapping and avoiding access to the data they don't need.

    On Mon, May 22, 2023 at 12:51 PM Weston Pace
    <[email protected]> wrote:

        I'm a little bit confused on the benchmark. The benchmark is
        labeled "open file" and yet "read_table" will read the entire
        file into memory.  I don't think your other benchmarks are
        doing this (e.g. they are not reading data into memory).

        As for the questions on memory mapping, I have a few answers
        below, but I will give a general answer here.  Memory mapped
        I/O will, at best, save you from one memcpy of the data from
        kernel space to user space.  Memory mapping is not the same
        as a "lazy dataframe".  If you ask Arrow to read a file then
        it will always load that file off of the disk and into
        memory.  This is true if you used memory mapped I/O or not. 
        If you ask it to load a single column, then it will not load
        the entire file, but instead load a single column. There are
        many other libraries that add "lazy dataframe" capabilities
        on top of Arrow files.

        What is it that you would like to achieve with Arrow?

        > According to the benchmark, the fonction to_pandas is
        loading all the data into memory.
        Do you agree or did I miss something ?

        Yes.  to_pandas will load the entire file into memory.

        > When you open an Arrow IPC file with memory mapping and add
        a column, does it write the column on disk ?

        If you open any existing file with memory mapping it's
        generally assumed it will be read only.  In theory, you could
        memory map a larger space, and then write into it over time,
        but none of the core Arrow utilities are going to do anything
        like that.

        > When opening a compressed Arrow IPC file, what does memory
        mapping means ? What is the difference with opening the same
        file without memory mapping ?

        This means that you will be able to avoid a memcpy of the
        compressed bytes from kernel space to user space.

        On Sun, May 21, 2023 at 10:32 AM Frédéric MASSON
        <[email protected]> wrote:

            Hello everyone,

            For several years I have been working with HDF5 files to
            store/load information and pandas as in-memory
            representation to analyze them. Globally the data can be
            of variable sizes (from a few MB to 10GB). I use the
            dataframes inside interactive tools (with a GUI, where
            the data access is quite random) and non-interactive
            tools (scripts), everything is in Python but the files
            could be opened in other languages. The typical use case
            is to get only some columns of the file, doing some
            operations on them and plot the result. Since the files
            are quite big, data compression is quite important for me
            to save disk-space. However writing duration is not very
            important.
            Of course, for the big files I faced the same
            performances issues as a lot of people :
            1. Access some columns with a row oriented file is quite
            inefficient
            2. loading 10GB of data into memory is long, generally
            not necessary and can be larger than RAM capacity on some
            machines.

            In order to face this issues, I came to a simple conclusion :
            1. The memory should be column oriented
            2. The in-memory layout should be the same as the on-disk
            memory. I am very interested in memory mapping since it
            allows me access files very quickly (there is no loading
            time) and open file larger than memory.

            The solution I implemented is quite simple
            1. I compress the data inside a HDF5 dataset with
            vertical chunks (nrows x 1) with the Blosc compressor
            (not Blosc2). HDF5 is a great container for data, that
            allow to chunk data with the shape the user want.
            Vertical chunk allows to decompress each column
            individually without decompressing the others. Inside the
            file, the columns names are stored inside the
            user-defined metadata of the dataset.
            2. With h5py I just open the HDF5 file and manipulate the
            h5py dataset object without reading its content. This
            way, I am doing a "memory-map" of a compressed file (or a
            "lazy" access I guess). When I access to a column, then
            the h5py actually reads and decompress the data
            on-the-fly but is totally transparent for me. This is not
            a zero-copy mechanism but I can access the data copying
            only the interesting data.

            The main goal with this "solution" is to reduce the time
            when a user opens a file and to reduce a lot the RAM usage.

            In order to access the columns with their names I made a
            small python library with a class that redefines the
            access operators. It is not a very handy library and
            right now I am considering transforming this class into a
            Pandas ExtensionArray. I am not sure but I think it would
            allow me to use the pandas dataframe class on a h5py
            dataset instead of a numpy array.

            I am also considering using Apache Arrow instead. That is
            why, recently I have been busy reading the Arrow
            documentation, the format specification and some blog
            articles. I must say that this library seems wonderful, I
            particularly love the fact that it tackle the problem of
            copying data and it is available in several languages.
            The zero-copy policy is exactly what I am looking for ! I
            also like the general format allowing to have columns of
            different types, nested columns and metadata for each
            columns. HDF5 does not allow to do all this.
            The documentation is quite heavy and I cannot say I
            understand everything.
            So I tried it !
            Actually I compared Arrow with my home-made solution in
            my use case (so not a very fair benchmark, I agree on
            that). With several lib/formats, this benchmark measures
            time and memory usage while it
            1. creates a table (100000*5000)
            2. writes it on disk
            3. opens the file
            4. computes a sum and a product a stores the result



            You must be careful with the memory usage I wrote. For
            pyArrow I used the Arrow memory pool information and for
            the rest I used tracemalloc but it may not be very
            accurate. The memory usage just tells me if the entire
            dataset is loaded or not.

            My questions are coming :) !
            At first I was wondering how the memory mapping worked
            when converted to pandas dataframe. According to the
            benchmark, the fonction to_pandas is loading all the data
            into memory.
            Do you agree or did I miss something ?
            When you open an Arrow IPC file with memory mapping and
            add a column, does it write the column on disk ?
            When opening a compressed Arrow IPC file, what does
            memory mapping means ? What is the difference with
            opening the same file without memory mapping ?

            Have you considered implemented a "lazy-reading" of
            compressed data ?
            Would it be relevant for the Arrow project ?
            I read the format specification
            (https://github.com/apache/arrow/blob/main/format/Message.fbs
            ) and I think only the data can be compressed. Not the
            metadata, I am wrong ?
            I also found the CompressedInputStream and the
            CompressedOutputStream. Is it some low level object
            compared to the write_feather ? Does write_feather use
            these objects ?

            Do you think Arrow could be a solution for my use case ?

            I simplified my benchmark and the source code is in
            attachment. Do you see it ?

            Some remarks :
            - At first, I tried PyTables but I faced too many issues.

            - I really like HDF5 because I can store several datasets
            (Tables) and organize them. For example, my simulation is
            giving me binary data and a log file (text), so inside my
            HDF5 file I am gathering everything linked so this
            simulation run : the sources files, the binary data and
            the log file. If I store the log and the binary
            separately I may not be able to make the connection
            between them later. I also like HDF5 for all the
            compressors available, especially the very interesting
            blosc compressor that is, I think, doing a job very
            complementary to what arrow is doing.

            -For the benchmarks, the files were stored on my
            hard-drive. I tried storing them on my SSD and operations
            with the "memory mapped" HDF5 were approximately 10x faster.

            If something is not clear or if you want more details
            please tell me,

            Best regards,

            Fred

Re: [Arrow IPC] memory mapping of compressed file / lazy reading

Reply via email to