[
https://issues.apache.org/jira/browse/ARROW-12519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17441916#comment-17441916
]
Weston Pace commented on ARROW-12519:
-------------------------------------
I'll close it. I think this issue in particular turned out to be python
holding onto the memory (something to do with exceptions or for loops). I'm
not aware of any real world, obvious and egregious jemalloc misbehavior at the
moment.
> [C++] Create/document better characterization of jemalloc/mimalloc
> ------------------------------------------------------------------
>
> Key: ARROW-12519
> URL: https://issues.apache.org/jira/browse/ARROW-12519
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Weston Pace
> Priority: Major
> Attachments: csv-uncompressed-8core.png
>
>
> The following script reads in a large dataset 10 times in a loop. The
> dataset being referred to is from Ursa benchmarks here
> ([https://github.com/ursacomputing/benchmarks).] However, any sufficiently
> large db should be sufficient. The dataset is ~5-6 GB when deserialized into
> an Arrow table. The conversion to a dataframe is not zero-copy and so the
> loop requires about 8.6GB.
> Running this code 10 times with mimalloc consumes 27GB of RAM. It is pretty
> deterministic. Even putting a 1 second sleep in between each run yields the
> same result. On the other hand if I put the read into its own method (second
> version below) then it uses only 14 GB.
> Our current rule of thumb seems to be "as long as the allocators stabilize to
> some number at some point then it is not a bug" so technically both 27GB and
> 14GB are valid.
> If we can't put any kind of bound whatsoever on the RAM that Arrow needs then
> it will eventually become a problem for adoption. I think we need to develop
> some sort of characterization around how much mimalloc/jemalloc should be
> allowed to over-allocate before we consider it a bug and require changing the
> code to avoid the situation (or documenting that certain operations are not
> valid).
>
> ----CODE----
>
> // First version (uses ~27GB)
> {code:java}
> import time
> import pyarrow as pa
> import pyarrow.parquet as pq
> import psutil
> import os
> pa.set_memory_pool(pa.mimalloc_memory_pool())
> print(pa.default_memory_pool().backend_name)
> for _ in range(10):
> table =
> pq.read_table('/home/pace/dev/benchmarks/benchmarks/data/temp/fanniemae_2016Q4.uncompressed.parquet')
> df = table.to_pandas()
> print(pa.total_allocated_bytes())
> proc = psutil.Process(os.getpid())
> print(proc.memory_info())
> {code}
> // Second version (uses ~14GB)
> {code:java}
> import time
> import pyarrow as pa
> import pyarrow.parquet as pq
> import psutil
> import os
> pa.set_memory_pool(pa.mimalloc_memory_pool())
> print(pa.default_memory_pool().backend_name)
> def bm():
> table =
> pq.read_table('/home/pace/dev/benchmarks/benchmarks/data/temp/fanniemae_2016Q4.uncompressed.parquet')
> df = table.to_pandas()
> print(pa.total_allocated_bytes())
> proc = psutil.Process(os.getpid())
> print(proc.memory_info())
> for _ in range(10):
> bm()
> {code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)