[ 
https://issues.apache.org/jira/browse/PARQUET-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16184375#comment-16184375
 ] 

Jim Pivarski commented on PARQUET-1084:
---------------------------------------

Yes, it's the memory map.

To avoid confusion, I removed the OS info above and am including OS info for a 
machine where I've been able to test it (has the file and the library). In each 
test, I fully evict `B2HHH.parquet-inflated` before attempting to read it.

In the first experiment, I open the file with system calls:

{{>>> import pyarrow
>>> f = pyarrow.OSFile("B2HHH-inflated.parquet", "rb")
>>> pyarrow.parquet.read_table(f, ["h1_px"])
pyarrow.Table
h1_px: double not null}}

and that reads in 11.3% of the file.

In the second, I open the file as a memory map:

{{>>> import pyarrow
>>> f = pyarrow.memory_map("B2HHH-inflated.parquet", "rb")
>>> pyarrow.parquet.read_table(f, ["h1_px"])
pyarrow.Table
h1_px: double not null}}

and this reads 100% of the file.

On a different file format, I observed that memory mapping was loading too many 
pages, but in that case it was about 2% instead of 1%. The memory mapping 
algorithm must have some look-ahead whose heuristics fail for files with a 
certain structure.

Here's the OS version and distribution where I did the above test:

{{% uname -a
Linux jimpivarskiroot-2 3.10.0-514.21.2.el7.x86_64 #1 SMP Mon Jun 19 12:10:08 
CDT 2017 x86_64 x86_64 x86_64 GNU/Linux
% lsb_release -a
LSB Version:    
:core-4.1-amd64:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:languages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch
Distributor ID: Scientific
Description:    Scientific Linux release 7.3 (Nitrogen)
Release:        7.3
Codename:       Nitrogen}}


> Parquet-C++ doesn't selectively read columns
> --------------------------------------------
>
>                 Key: PARQUET-1084
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1084
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-cpp
>    Affects Versions: cpp-1.0.0, cpp-1.2.0
>            Reporter: Jim Pivarski
>              Labels: performance
>             Fix For: cpp-1.3.0
>
>
> I first saw this reported in a [review of file formats for 
> C++](https://indico.cern.ch/event/567550/contributions/2628878/attachments/1511966/2358123/hep-file-formats.pdf),
>  which showed that an attempt to read two columns from a Parquet file in C++ 
> resulted in the whole file— 26 columns— being read (18th page of the PDF, "15 
> / 25" in the bottom-right corner). That test used Parquet-C++ version 1.2.0.
> To check this, I pip-installed pyarrow (version 0.6.0), which comes with 
> Parquet-C++ version 1.0.0. I used [vmtouch](https://hoytech.com/vmtouch/) to 
> identify the fraction of pages touched, and double-checked by measuring the 
> time-to-load. The fact that it's a slow disk makes it obvious whether it's 
> reading one column or all columns.
> I'm using the same files as the presenter of that talk: 
> [B2HHH.parquet-inflated](https://cernbox.cern.ch/index.php/s/ub43DwvQIFwxfxs/download?path=%2F&files=B2HHH.parquet-inflated)
>  and 
> [B2HHH.parquet-deflated](https://cernbox.cern.ch/index.php/s/ub43DwvQIFwxfxs/download?path=%2F&files=B2HHH.parquet-deflated).
>  They have 20 double-precision columns and 6 int32 columns with no nesting, 
> 500 rows per group * 17113 row groups = 8556118 rows = 1.5 GB for the 
> inflated (uncompressed) file. Each column within a row group should be 4000 
> or 2000 bytes, so reading one column should be one or two 4k disk pages per 
> row group out of 769 disk pages per row group, depending on alignment— 
> granularity should not be a problem, as it would be if the row groups were 
> too small.
> *Procedure:*
> # I evicted the uncompressed file from VM cache to force reads to come from 
> disk.
> # I imported {{pyarrow.parquet}} in Python and called 
> {{read_table("data/B2HHH-inflated.parquet", ["h1_px"])}} (one column).
> # I checked to see how much of the file has been loaded into VM cache.
> # I also checked the time-to-load of one column from cold cache versus all 
> columns from cold cache.
> The result is that the entire file get loaded into VM cache and the file 
> takes 14.6 seconds to read regardless of whether I read one column or the 
> whole file. (From warm cache is 4.7 seconds, so we're clearly seeing the 
> effect of disk speed.) Both methods agree that the file is _not_ being 
> selectively read, as I think it should be.
> Is there a setting that the presenter of the talk (using Parquet-C++ version 
> 1.2.0 in C++) and I (using pyarrow with Parquet-C++ 1.0.0 in Python) are both 
> missing? Is this a future feature? I would consider it to be a performance 
> bug, since a major reason for having a columnar data format is to read 
> columns selectively.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to