[ https://issues.apache.org/jira/browse/PARQUET-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16288314#comment-16288314 ]
Wes McKinney commented on PARQUET-1084: --------------------------------------- It seems this is related to the use of mmap. Perhaps we should turn off mmapping by default, though it's very weird that it would be reading the whole file. I don't know how this will impact performance in general > Parquet-C++ doesn't selectively read columns with mmap'ed files > --------------------------------------------------------------- > > Key: PARQUET-1084 > URL: https://issues.apache.org/jira/browse/PARQUET-1084 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp > Affects Versions: cpp-1.0.0, cpp-1.2.0 > Reporter: Jim Pivarski > Labels: performance > Fix For: cpp-1.4.0 > > > I first saw this reported in a [review of file formats for > C++](https://indico.cern.ch/event/567550/contributions/2628878/attachments/1511966/2358123/hep-file-formats.pdf), > which showed that an attempt to read two columns from a Parquet file in C++ > resulted in the whole file— 26 columns— being read (18th page of the PDF, "15 > / 25" in the bottom-right corner). That test used Parquet-C++ version 1.2.0. > To check this, I pip-installed pyarrow (version 0.6.0), which comes with > Parquet-C++ version 1.0.0. I used [vmtouch](https://hoytech.com/vmtouch/) to > identify the fraction of pages touched, and double-checked by measuring the > time-to-load. The fact that it's a slow disk makes it obvious whether it's > reading one column or all columns. > I'm using the same files as the presenter of that talk: > [B2HHH.parquet-inflated](https://cernbox.cern.ch/index.php/s/ub43DwvQIFwxfxs/download?path=%2F&files=B2HHH.parquet-inflated) > and > [B2HHH.parquet-deflated](https://cernbox.cern.ch/index.php/s/ub43DwvQIFwxfxs/download?path=%2F&files=B2HHH.parquet-deflated). > They have 20 double-precision columns and 6 int32 columns with no nesting, > 500 rows per group * 17113 row groups = 8556118 rows = 1.5 GB for the > inflated (uncompressed) file. Each column within a row group should be 4000 > or 2000 bytes, so reading one column should be one or two 4k disk pages per > row group out of 769 disk pages per row group, depending on alignment— > granularity should not be a problem, as it would be if the row groups were > too small. > *Procedure:* > # I evicted the uncompressed file from VM cache to force reads to come from > disk. > # I imported {{pyarrow.parquet}} in Python and called > {{read_table("data/B2HHH-inflated.parquet", ["h1_px"])}} (one column). > # I checked to see how much of the file has been loaded into VM cache. > # I also checked the time-to-load of one column from cold cache versus all > columns from cold cache. > The result is that the entire file get loaded into VM cache and the file > takes 14.6 seconds to read regardless of whether I read one column or the > whole file. (From warm cache is 4.7 seconds, so we're clearly seeing the > effect of disk speed.) Both methods agree that the file is _not_ being > selectively read, as I think it should be. > Is there a setting that the presenter of the talk (using Parquet-C++ version > 1.2.0 in C++) and I (using pyarrow with Parquet-C++ 1.0.0 in Python) are both > missing? Is this a future feature? I would consider it to be a performance > bug, since a major reason for having a columnar data format is to read > columns selectively. -- This message was sent by Atlassian JIRA (v6.4.14#64029)