[GitHub] [arrow] pitrou commented on a change in pull request #10266: ARROW-12650: [Doc][Python] Improve documentation regarding dealing with memory mapped files

GitBox Wed, 12 May 2021 01:39:17 -0700


pitrou commented on a change in pull request #10266:
URL: https://github.com/apache/arrow/pull/10266#discussion_r630837422




##########
File path: docs/source/python/memory.rst
##########
@@ -277,6 +277,95 @@ types than with normal Python file objects.
    !rm example.dat
    !rm example2.dat
 
+Efficiently Writing and Reading Arrow Arrays
+--------------------------------------------
+
+Being optimized for zero copy and memory mapped data, Arrow allows to easily
+read and write arrays consuming the minimum amount of resident memory.
+
+When writing and reading raw arrow data, we can use the Arrow File Format
+or the Arrow Streaming Format.
+
+To dump an array to file, you can use the :meth:`~pyarrow.ipc.new_file`
+which will provide a new :class:`~pyarrow.ipc.RecordBatchFileWriter` instance
+that can be used to write batches of data to that file.
+
+For example to write an array of 100M integers, we could write it in 1000 
chunks
+of 100000 entries:
+
+.. ipython:: python

Review comment:
       I know we've started using those `ipython` blocks, but I'm really not 
fond of them. They seem to make building docs quite slower (especially if the 
workload is non-trivial).
   
   @jorisvandenbossche What do you think?

##########
File path: docs/source/python/memory.rst
##########
@@ -277,6 +277,95 @@ types than with normal Python file objects.
    !rm example.dat
    !rm example2.dat
 
+Efficiently Writing and Reading Arrow Arrays
+--------------------------------------------
+
+Being optimized for zero copy and memory mapped data, Arrow allows to easily
+read and write arrays consuming the minimum amount of resident memory.
+
+When writing and reading raw arrow data, we can use the Arrow File Format
+or the Arrow Streaming Format.
+
+To dump an array to file, you can use the :meth:`~pyarrow.ipc.new_file`
+which will provide a new :class:`~pyarrow.ipc.RecordBatchFileWriter` instance
+that can be used to write batches of data to that file.
+
+For example to write an array of 100M integers, we could write it in 1000 
chunks
+of 100000 entries:
+
+.. ipython:: python
+
+    BATCH_SIZE = 100000
+    NUM_BATCHES = 1000
+
+    schema = pa.schema([pa.field('nums', pa.int32())])
+
+    with pa.OSFile('bigfile.arrow', 'wb') as sink:
+        with pa.ipc.new_file(sink, schema) as writer:
+            for row in range(NUM_BATCHES):
+                batch = pa.record_batch([pa.array(range(BATCH_SIZE), 
type=pa.int32())], schema)
+                writer.write(batch)
+
+record batches support multiple columns, so in practice we always write the
+equivalent of a :class:`~pyarrow.Table`.
+
+Writing in batches is effective because we in theory need to keep in memory 
only
+the current batch we are writing. But when reading back, we can be even more 
effective
+by directly mapping the data from disk and avoid allocating any new memory on 
read.
+
+Under normal conditions, reading back our file will consume a few hundred 
megabytes
+of memory:
+
+.. ipython:: python
+
+    with pa.OSFile('bigfile.arrow', 'rb') as source:
+        loaded_array = pa.ipc.open_file(source).read_all()
+
+    print("LEN:", len(loaded_array))
+    print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))
+
+To more efficiently read big data from disk, we can memory map the file, so 
that
+the array can directly reference the data on disk and avoid copying it to 
memory.
+In such case the memory consumption is greatly reduced and it's possible to 
read
+arrays bigger than the total memory

Review comment:
       This is rather misleading. The data is loaded back to memory when it is 
being read. It's just that it's read lazily, so the costs are not paid up front 
(and the cost is not paid for data that is not accessed).
   
   What memory mapping can avoid is an intermediate copy when reading the data. 
So it is more performant in that sense.

##########
File path: docs/source/python/memory.rst
##########
@@ -277,6 +277,95 @@ types than with normal Python file objects.
    !rm example.dat
    !rm example2.dat
 
+Efficiently Writing and Reading Arrow Arrays
+--------------------------------------------
+
+Being optimized for zero copy and memory mapped data, Arrow allows to easily
+read and write arrays consuming the minimum amount of resident memory.
+
+When writing and reading raw arrow data, we can use the Arrow File Format
+or the Arrow Streaming Format.
+
+To dump an array to file, you can use the :meth:`~pyarrow.ipc.new_file`
+which will provide a new :class:`~pyarrow.ipc.RecordBatchFileWriter` instance
+that can be used to write batches of data to that file.
+
+For example to write an array of 100M integers, we could write it in 1000 
chunks
+of 100000 entries:
+
+.. ipython:: python
+
+    BATCH_SIZE = 100000
+    NUM_BATCHES = 1000
+
+    schema = pa.schema([pa.field('nums', pa.int32())])
+
+    with pa.OSFile('bigfile.arrow', 'wb') as sink:
+        with pa.ipc.new_file(sink, schema) as writer:
+            for row in range(NUM_BATCHES):
+                batch = pa.record_batch([pa.array(range(BATCH_SIZE), 
type=pa.int32())], schema)
+                writer.write(batch)
+
+record batches support multiple columns, so in practice we always write the
+equivalent of a :class:`~pyarrow.Table`.
+
+Writing in batches is effective because we in theory need to keep in memory 
only
+the current batch we are writing. But when reading back, we can be even more 
effective
+by directly mapping the data from disk and avoid allocating any new memory on 
read.
+
+Under normal conditions, reading back our file will consume a few hundred 
megabytes
+of memory:
+
+.. ipython:: python
+
+    with pa.OSFile('bigfile.arrow', 'rb') as source:
+        loaded_array = pa.ipc.open_file(source).read_all()
+
+    print("LEN:", len(loaded_array))
+    print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))
+
+To more efficiently read big data from disk, we can memory map the file, so 
that
+the array can directly reference the data on disk and avoid copying it to 
memory.
+In such case the memory consumption is greatly reduced and it's possible to 
read
+arrays bigger than the total memory
+
+.. ipython:: python
+
+    with pa.memory_map('bigfile.arrow', 'r') as source:
+        loaded_array = pa.ipc.open_file(source).read_all()
+    print("LEN:", len(loaded_array))
+    print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))
+
+Equally we can write back to disk the loaded array without consuming any
+extra memory thanks to the fact that iterating over the array will just
+scan through the memory mapped data without the need to make copies of it

Review comment:
       This seems a bit misleading again. First, I don't understand the point 
of writing back data that's read from a memory-mapped file (just copy the file 
if that's what you want to do?). Second, the fact that writing data doesn't 
consume additional memory has nothing to do with the fact that the data is 
memory-mapped, AFAICT.

##########
File path: docs/source/python/memory.rst
##########
@@ -277,6 +277,95 @@ types than with normal Python file objects.
    !rm example.dat
    !rm example2.dat
 
+Efficiently Writing and Reading Arrow Arrays
+--------------------------------------------
+
+Being optimized for zero copy and memory mapped data, Arrow allows to easily
+read and write arrays consuming the minimum amount of resident memory.
+
+When writing and reading raw arrow data, we can use the Arrow File Format

Review comment:
       Always "Arrow" capitalized.

##########
File path: docs/source/python/memory.rst
##########
@@ -277,6 +277,95 @@ types than with normal Python file objects.
    !rm example.dat
    !rm example2.dat
 
+Efficiently Writing and Reading Arrow Arrays

Review comment:
       It seems this should go into `ipc.rst`. `memory.rst` is about the 
low-level memory and IO APIs.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] pitrou commented on a change in pull request #10266: ARROW-12650: [Doc][Python] Improve documentation regarding dealing with memory mapped files

Reply via email to