Hi Iceberg community,

I'm from Amazon and very new to the space, so please bear with me for any
naive questions.

I'm currently looking into adding NaN counts for float and double columns
(described in #348 <https://github.com/apache/iceberg/pull/348>). I noticed
that metrics like upper/lower bounds and null value counts come from the
individual Parquet/Orc writer themselves during writing (e.g. for Parquet, `
*ParquetWriteAdapter*` exposes metrics from the footer of parquet library's
`*ParquetWriter*`; for ORC, `*OrcFileAppender*` extracts metrics from ORC
library's `*Writer*`; I don't think we have metrics for avro content
files), and while they store things like null counts and min/max (e.g. `
Statistics
<https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/statistics/Statistics.java>`
in Parquet, `ColumnStatistics
<https://github.com/apache/orc/blob/master/java/core/src/java/org/apache/orc/DoubleColumnStatistics.java>`
in ORC), they don't keep NaN counts.

I was looking into creating a shim layer to maintain the extra NaN counter
on top of them, but it looks like in both writers the statistics updates
are tightly coupled with the writer itself (e.g. in Parquet: `
ColumnWriterBase
<https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterBase.java#L201>`,
in Orc: `TreeWriter
<https://github.com/apache/orc/blob/master/java/core/src/java/org/apache/orc/impl/writer/TreeWriterBase.java#L180-L222>`),
and this approach also doesn't help us with removing NaN from upper/lower
bounds.

I'd like to (1) verify my understanding is correct, and (2) gather
suggestions on if there is a better way than updating the parquet/orc
library to add NaN counters and new min/max stats that don't count NaN.

Thank you!
Yan

Reply via email to