Hi Iceberg community, I'm from Amazon and very new to the space, so please bear with me for any naive questions.
I'm currently looking into adding NaN counts for float and double columns (described in #348 <https://github.com/apache/iceberg/pull/348>). I noticed that metrics like upper/lower bounds and null value counts come from the individual Parquet/Orc writer themselves during writing (e.g. for Parquet, ` *ParquetWriteAdapter*` exposes metrics from the footer of parquet library's `*ParquetWriter*`; for ORC, `*OrcFileAppender*` extracts metrics from ORC library's `*Writer*`; I don't think we have metrics for avro content files), and while they store things like null counts and min/max (e.g. ` Statistics <https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/statistics/Statistics.java>` in Parquet, `ColumnStatistics <https://github.com/apache/orc/blob/master/java/core/src/java/org/apache/orc/DoubleColumnStatistics.java>` in ORC), they don't keep NaN counts. I was looking into creating a shim layer to maintain the extra NaN counter on top of them, but it looks like in both writers the statistics updates are tightly coupled with the writer itself (e.g. in Parquet: ` ColumnWriterBase <https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterBase.java#L201>`, in Orc: `TreeWriter <https://github.com/apache/orc/blob/master/java/core/src/java/org/apache/orc/impl/writer/TreeWriterBase.java#L180-L222>`), and this approach also doesn't help us with removing NaN from upper/lower bounds. I'd like to (1) verify my understanding is correct, and (2) gather suggestions on if there is a better way than updating the parquet/orc library to add NaN counters and new min/max stats that don't count NaN. Thank you! Yan
