[
https://issues.apache.org/jira/browse/ARROW-12519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17441823#comment-17441823
]
Antoine Pitrou edited comment on ARROW-12519 at 11/10/21, 4:20 PM:
-------------------------------------------------------------------
[~westonpace] Should we do something with this JIRA? (sorry, edited: "JIRA" not
"PR")
was (Author: pitrou):
[~westonpace] Should we do something with this PR?
> [C++] Create/document better characterization of jemalloc/mimalloc
> ------------------------------------------------------------------
>
> Key: ARROW-12519
> URL: https://issues.apache.org/jira/browse/ARROW-12519
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Weston Pace
> Priority: Major
> Attachments: csv-uncompressed-8core.png
>
>
> The following script reads in a large dataset 10 times in a loop. The
> dataset being referred to is from Ursa benchmarks here
> ([https://github.com/ursacomputing/benchmarks).] However, any sufficiently
> large db should be sufficient. The dataset is ~5-6 GB when deserialized into
> an Arrow table. The conversion to a dataframe is not zero-copy and so the
> loop requires about 8.6GB.
> Running this code 10 times with mimalloc consumes 27GB of RAM. It is pretty
> deterministic. Even putting a 1 second sleep in between each run yields the
> same result. On the other hand if I put the read into its own method (second
> version below) then it uses only 14 GB.
> Our current rule of thumb seems to be "as long as the allocators stabilize to
> some number at some point then it is not a bug" so technically both 27GB and
> 14GB are valid.
> If we can't put any kind of bound whatsoever on the RAM that Arrow needs then
> it will eventually become a problem for adoption. I think we need to develop
> some sort of characterization around how much mimalloc/jemalloc should be
> allowed to over-allocate before we consider it a bug and require changing the
> code to avoid the situation (or documenting that certain operations are not
> valid).
>
> ----CODE----
>
> // First version (uses ~27GB)
> {code:java}
> import time
> import pyarrow as pa
> import pyarrow.parquet as pq
> import psutil
> import os
> pa.set_memory_pool(pa.mimalloc_memory_pool())
> print(pa.default_memory_pool().backend_name)
> for _ in range(10):
> table =
> pq.read_table('/home/pace/dev/benchmarks/benchmarks/data/temp/fanniemae_2016Q4.uncompressed.parquet')
> df = table.to_pandas()
> print(pa.total_allocated_bytes())
> proc = psutil.Process(os.getpid())
> print(proc.memory_info())
> {code}
> // Second version (uses ~14GB)
> {code:java}
> import time
> import pyarrow as pa
> import pyarrow.parquet as pq
> import psutil
> import os
> pa.set_memory_pool(pa.mimalloc_memory_pool())
> print(pa.default_memory_pool().backend_name)
> def bm():
> table =
> pq.read_table('/home/pace/dev/benchmarks/benchmarks/data/temp/fanniemae_2016Q4.uncompressed.parquet')
> df = table.to_pandas()
> print(pa.total_allocated_bytes())
> proc = psutil.Process(os.getpid())
> print(proc.memory_info())
> for _ in range(10):
> bm()
> {code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)