Weston Pace created ARROW-12519:
-----------------------------------
Summary: [C++] Create/document better characterization of
jemalloc/mimalloc
Key: ARROW-12519
URL: https://issues.apache.org/jira/browse/ARROW-12519
Project: Apache Arrow
Issue Type: Improvement
Components: C++
Reporter: Weston Pace
The following script reads in a large dataset 10 times in a loop. The dataset
being referred to is from Ursa benchmarks here
([https://github.com/ursacomputing/benchmarks).] However, any sufficiently
large db should be sufficient. The dataset is ~5-6 GB when deserialized into
an Arrow table. The conversion to a dataframe is not zero-copy and so the loop
requires about 8.6GB.
Running this code 10 times with mimalloc consumes 27GB of RAM. It is pretty
deterministic. Even putting a 1 second sleep in between each run yields the
same result. On the other hand if I put the read into its own method (second
version below) then it uses only 14 GB.
Our current rule of thumb seems to be "as long as the allocators stabilize to
some number at some point then it is not a bug" so technically both 27GB and
14GB are valid.
If we can't put any kind of bound whatsoever on the RAM that Arrow needs then
it will eventually become a problem for adoption. I think we need to develop
some sort of characterization around how much mimalloc/jemalloc should be
allowed to over-allocate before we consider it a bug and require changing the
code to avoid the situation (or documenting that certain operations are not
valid).
----CODE----
// First version (uses ~27GB)
{code:java}
import time
import pyarrow as pa
import pyarrow.parquet as pq
import psutil
import os
pa.set_memory_pool(pa.mimalloc_memory_pool())
print(pa.default_memory_pool().backend_name)
for _ in range(10):
table =
pq.read_table('/home/pace/dev/benchmarks/benchmarks/data/temp/fanniemae_2016Q4.uncompressed.parquet')
df = table.to_pandas()
print(pa.total_allocated_bytes())
proc = psutil.Process(os.getpid())
print(proc.memory_info())
{code}
// Second version (uses ~14GB)
{code:java}
import time
import pyarrow as pa
import pyarrow.parquet as pq
import psutil
import os
pa.set_memory_pool(pa.mimalloc_memory_pool())
print(pa.default_memory_pool().backend_name)
def bm():
table =
pq.read_table('/home/pace/dev/benchmarks/benchmarks/data/temp/fanniemae_2016Q4.uncompressed.parquet')
df = table.to_pandas()
print(pa.total_allocated_bytes())
proc = psutil.Process(os.getpid())
print(proc.memory_info())
for _ in range(10):
bm()
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)