[ https://issues.apache.org/jira/browse/IMPALA-6964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16745174#comment-16745174 ]
ASF subversion and git services commented on IMPALA-6964: --------------------------------------------------------- Commit 8da44ce16bb190dadab2ff3d22e5df726d1128e3 in impala's branch refs/heads/master from stakiar [ https://gitbox.apache.org/repos/asf?p=impala.git;h=8da44ce ] IMPALA-6964: Track stats about column and page sizes in Parquet reader Adds the following new stats: * ParquetCompressedPageSize - a summary (average, min, max) counter that tracks the size of compressed pages read, if no compressed pages are read then this counter is empty * ParquetUncompressedPageSize - a summary counter that tracks the size of uncompressed pages read, it is updated in two places: (1) when a compressed page is de-compressed, and (2) when a page that is not compressed is read * ParquetCompressedDataReadPerColumn - a summary counter that tracks the amount of compressed data read per column for a scan node * ParquetUncompressedDataReadPerColumn - a summary counter that tracks the amount of uncompressed data read per column for a scan node The PerColumn counters are calculated by aggregating the number of bytes read for each column across all scan ranges processed by a scan node. Each sample in the counter is the size of a single column. Here is an example of what the updated HDFS scan profile looks like: - ParquetCompressedDataReadPerColumn: (Avg: 227.56 KB (233018) ; Min: 225.14 KB (230540) ; Max: 229.98 KB (235496) ; Number of samples: 2) - ParquetUncompressedDataReadPerColumn: (Avg: 227.96 KB (233426) ; Min: 224.91 KB (230306) ; Max: 231.00 KB (236547) ; Number of samples: 2) - ParquetCompressedPageSize: (Avg: 4.46 KB (4568) ; Min: 3.86 KB (3955) ; Max: 5.19 KB (5315) ; Number of samples: 102) - ParquetDecompressedPageSize: (Avg: 4.47 KB (4576) ; Min: 3.86 KB (3950) ; Max: 5.22 KB (5349) ; Number of samples: 102) Testing: * Added new tests to test_scanners.py that do some basic validation of the new counters above Change-Id: I322f9b324b6828df28e5caf79529085c43d7c817 Reviewed-on: http://gerrit.cloudera.org:8080/11575 Reviewed-by: Tim Armstrong <tarmstr...@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com> > Track stats about column and page sizes in Parquet reader > --------------------------------------------------------- > > Key: IMPALA-6964 > URL: https://issues.apache.org/jira/browse/IMPALA-6964 > Project: IMPALA > Issue Type: Improvement > Components: Backend > Reporter: Tim Armstrong > Assignee: Sahil Takiar > Priority: Major > Labels: observability, parquet, ramp-up > > It would be good to have stats for scanned parquet data about page sizes. We > currently can't tell much about the "shape" of the parquet pages from the > profile. Some questions that are interesting: > * How big is each column? I.e. total compressed and decompressed size read. > * How big are pages on average? Either compressed or decompressed size > * What is the compression ratio for pages? Could be inferred from the above > two. > I think storing all the stats in the profile per-column would be too much > data, but we could probably infer most useful things from higher-level > aggregates. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org