Weston Pace created ARROW-16451: ----------------------------------- Summary: [C++] ParquetFileFragment caches parquet file metadata and there is no way to disable this Key: ARROW-16451 URL: https://issues.apache.org/jira/browse/ARROW-16451 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Weston Pace
When looking at ARROW-15081 there was a strange amount of memory used even when we were accumulating all of the results into a single 64 byte counter (e.g. {{SELECT COUNT(*) FROM table}}). It turns out this was the parquet metadata, which gets attached to the parquet file fragment. There is no way to prevent this and, in this case, it was using quite a bit of RAM. There were 1100 files and each file had ~10MB of metadata. We should have an option for disabling this. Also, this should probably be off by default. It can be a useful thing to cache if you are going to run the same dataset again and again but otherwise it is just wasted RAM. -- This message was sent by Atlassian Jira (v8.20.7#820007)