Hi,

I was going through the code in Iceberg ParquetReader. Could anybody confirm or 
correct my statements below?

Right now, Iceberg can filter out row groups in Parquet. Iceberg fetches row 
group stats from the footer and applies ParquetMetricsRowGroupFilter on that 
information. In addition, the footer contains metadata per column chunk 
including its offset. ParquetDictionaryRowGroupFilter uses that column chunk 
metadata to read an optional dictionary page for each column chunk. If a 
dictionary page is present, it will always be at the beginning of each column 
chunk. ParquetDictionaryRowGroupFilter ensures that all pages within a column 
chunk are dictionary encoded when Iceberg filters out row groups based on 
dictionaries.

Also, I have a question about skipping individual pages using page stats. To 
the best of my knowledge, this info was originally stored in page headers, 
which made page skipping not as efficient as it could be because it required 
reading all page headers spread out throughout the file. I remember some 
efforts in the Parquet community to add page level statistics to the footer.

Now let's assume we have page level stats in the footer or have an efficient 
way to collect that info. Then we have a query that covers two columns. Using a 
predicate on the first column, we see that page 3 doesn't contain any relevant 
values, so we can skip the entire page for that column. However, we cannot just 
skip page 3 for the second column as the number of values within a page is not 
fixed and might vary between column chunks. Basically, there is no one-to-one 
mapping between pages.

My question is if we can have a relatively efficient page skipping in Parquet 
at this point.

Thanks,
Anton

Reply via email to