[ 
https://issues.apache.org/jira/browse/PARQUET-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16288314#comment-16288314
 ] 

Wes McKinney commented on PARQUET-1084:
---------------------------------------

It seems this is related to the use of mmap. Perhaps we should turn off 
mmapping by default, though it's very weird that it would be reading the whole 
file. I don't know how this will impact performance in general

> Parquet-C++ doesn't selectively read columns with mmap'ed files
> ---------------------------------------------------------------
>
>                 Key: PARQUET-1084
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1084
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-cpp
>    Affects Versions: cpp-1.0.0, cpp-1.2.0
>            Reporter: Jim Pivarski
>              Labels: performance
>             Fix For: cpp-1.4.0
>
>
> I first saw this reported in a [review of file formats for 
> C++](https://indico.cern.ch/event/567550/contributions/2628878/attachments/1511966/2358123/hep-file-formats.pdf),
>  which showed that an attempt to read two columns from a Parquet file in C++ 
> resulted in the whole file— 26 columns— being read (18th page of the PDF, "15 
> / 25" in the bottom-right corner). That test used Parquet-C++ version 1.2.0.
> To check this, I pip-installed pyarrow (version 0.6.0), which comes with 
> Parquet-C++ version 1.0.0. I used [vmtouch](https://hoytech.com/vmtouch/) to 
> identify the fraction of pages touched, and double-checked by measuring the 
> time-to-load. The fact that it's a slow disk makes it obvious whether it's 
> reading one column or all columns.
> I'm using the same files as the presenter of that talk: 
> [B2HHH.parquet-inflated](https://cernbox.cern.ch/index.php/s/ub43DwvQIFwxfxs/download?path=%2F&files=B2HHH.parquet-inflated)
>  and 
> [B2HHH.parquet-deflated](https://cernbox.cern.ch/index.php/s/ub43DwvQIFwxfxs/download?path=%2F&files=B2HHH.parquet-deflated).
>  They have 20 double-precision columns and 6 int32 columns with no nesting, 
> 500 rows per group * 17113 row groups = 8556118 rows = 1.5 GB for the 
> inflated (uncompressed) file. Each column within a row group should be 4000 
> or 2000 bytes, so reading one column should be one or two 4k disk pages per 
> row group out of 769 disk pages per row group, depending on alignment— 
> granularity should not be a problem, as it would be if the row groups were 
> too small.
> *Procedure:*
> # I evicted the uncompressed file from VM cache to force reads to come from 
> disk.
> # I imported {{pyarrow.parquet}} in Python and called 
> {{read_table("data/B2HHH-inflated.parquet", ["h1_px"])}} (one column).
> # I checked to see how much of the file has been loaded into VM cache.
> # I also checked the time-to-load of one column from cold cache versus all 
> columns from cold cache.
> The result is that the entire file get loaded into VM cache and the file 
> takes 14.6 seconds to read regardless of whether I read one column or the 
> whole file. (From warm cache is 4.7 seconds, so we're clearly seeing the 
> effect of disk speed.) Both methods agree that the file is _not_ being 
> selectively read, as I think it should be.
> Is there a setting that the presenter of the talk (using Parquet-C++ version 
> 1.2.0 in C++) and I (using pyarrow with Parquet-C++ 1.0.0 in Python) are both 
> missing? Is this a future feature? I would consider it to be a performance 
> bug, since a major reason for having a columnar data format is to read 
> columns selectively.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to