[ 
https://issues.apache.org/jira/browse/IMPALA-6964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16745174#comment-16745174
 ] 

ASF subversion and git services commented on IMPALA-6964:
---------------------------------------------------------

Commit 8da44ce16bb190dadab2ff3d22e5df726d1128e3 in impala's branch 
refs/heads/master from stakiar
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=8da44ce ]

IMPALA-6964: Track stats about column and page sizes in Parquet reader

Adds the following new stats:

* ParquetCompressedPageSize - a summary (average, min, max) counter that
tracks the size of compressed pages read, if no compressed pages are
read then this counter is empty
* ParquetUncompressedPageSize - a summary counter that tracks the size
of uncompressed pages read, it is updated in two places: (1) when a
compressed page is de-compressed, and (2) when a page that is not
compressed is read
* ParquetCompressedDataReadPerColumn - a summary counter that tracks the
amount of compressed data read per column for a scan node
* ParquetUncompressedDataReadPerColumn - a summary counter that tracks
the amount of uncompressed data read per column for a scan node

The PerColumn counters are calculated by aggregating the number of bytes
read for each column across all scan ranges processed by a scan node.
Each sample in the counter is the size of a single column.

Here is an example of what the updated HDFS scan profile looks like:

- ParquetCompressedDataReadPerColumn: (Avg: 227.56 KB (233018) ;
Min: 225.14 KB (230540) ; Max: 229.98 KB (235496) ; Number of samples: 2)
- ParquetUncompressedDataReadPerColumn: (Avg: 227.96 KB (233426) ;
Min: 224.91 KB (230306) ; Max: 231.00 KB (236547) ; Number of samples: 2)
- ParquetCompressedPageSize: (Avg: 4.46 KB (4568) ; Min: 3.86 KB (3955) ;
Max: 5.19 KB (5315) ; Number of samples: 102)
- ParquetDecompressedPageSize: (Avg: 4.47 KB (4576) ; Min: 3.86 KB (3950)
 ; Max: 5.22 KB (5349) ; Number of samples: 102)

Testing:
* Added new tests to test_scanners.py that do some basic validation of
the new counters above

Change-Id: I322f9b324b6828df28e5caf79529085c43d7c817
Reviewed-on: http://gerrit.cloudera.org:8080/11575
Reviewed-by: Tim Armstrong <tarmstr...@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com>


> Track stats about column and page sizes in Parquet reader
> ---------------------------------------------------------
>
>                 Key: IMPALA-6964
>                 URL: https://issues.apache.org/jira/browse/IMPALA-6964
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend
>            Reporter: Tim Armstrong
>            Assignee: Sahil Takiar
>            Priority: Major
>              Labels: observability, parquet, ramp-up
>
> It would be good to have stats for scanned parquet data about page sizes. We 
> currently can't tell much about the "shape" of the parquet pages from the 
> profile. Some questions that are interesting:
> * How big is each column? I.e. total compressed and decompressed size read.
> * How big are pages on average? Either compressed or decompressed size
> * What is the compression ratio for pages? Could be inferred from the above 
> two.
> I think storing all the stats in the profile per-column would be too much 
> data, but we could probably infer most useful things from higher-level 
> aggregates.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to