Hi, I was going through the code in Iceberg ParquetReader. Could anybody confirm or correct my statements below?
Right now, Iceberg can filter out row groups in Parquet. Iceberg fetches row group stats from the footer and applies ParquetMetricsRowGroupFilter on that information. In addition, the footer contains metadata per column chunk including its offset. ParquetDictionaryRowGroupFilter uses that column chunk metadata to read an optional dictionary page for each column chunk. If a dictionary page is present, it will always be at the beginning of each column chunk. ParquetDictionaryRowGroupFilter ensures that all pages within a column chunk are dictionary encoded when Iceberg filters out row groups based on dictionaries. Also, I have a question about skipping individual pages using page stats. To the best of my knowledge, this info was originally stored in page headers, which made page skipping not as efficient as it could be because it required reading all page headers spread out throughout the file. I remember some efforts in the Parquet community to add page level statistics to the footer. Now let's assume we have page level stats in the footer or have an efficient way to collect that info. Then we have a query that covers two columns. Using a predicate on the first column, we see that page 3 doesn't contain any relevant values, so we can skip the entire page for that column. However, we cannot just skip page 3 for the second column as the number of values within a page is not fixed and might vary between column chunks. Basically, there is no one-to-one mapping between pages. My question is if we can have a relatively efficient page skipping in Parquet at this point. Thanks, Anton
