Weston Pace created ARROW-16451:
-----------------------------------

             Summary: [C++] ParquetFileFragment caches parquet file metadata 
and there is no way to disable this
                 Key: ARROW-16451
                 URL: https://issues.apache.org/jira/browse/ARROW-16451
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++
            Reporter: Weston Pace


When looking at ARROW-15081 there was a strange amount of memory used even when 
we were accumulating all of the results into a single 64 byte counter (e.g. 
{{SELECT COUNT(*) FROM table}}).

It turns out this was the parquet metadata, which gets attached to the parquet 
file fragment.  There is no way to prevent this and, in this case, it was using 
quite a bit of RAM.  There were 1100 files and each file had ~10MB of metadata.

We should have an option for disabling this.  Also, this should probably be off 
by default.  It can be a useful thing to cache if you are going to run the same 
dataset again and again but otherwise it is just wasted RAM.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to