Hi Yan, I think you’re correct about how everything works right now. Because the ORC and Parquet writers already keep statistics, Iceberg uses those instead of keeping its own. And that means that Avro doesn’t yet have stats implemented.
There’s a great start to adding stats for Avro files in PR #1060 <https://github.com/apache/iceberg/pull/1060>, but I think that the actual implementation would go a slightly different direction. That PR adds a set of classes to inspect a record and produce metrics. The problem is that this is very similar to the writers, which also need to traverse the structure of a record and write out individual leaf fields. I think it would be better to combine the two, so that all of the writers will keep track of metrics. That way, we only traverse the record structure once, and we only have one set of classes that are type-specific (a FloatWriter would also gather floating point metrics). I think the interfaces in that PR are pretty good, we just need to implement them in the writer tree. For Parquet NaN metrics, I would bring in those interfaces, implement them in the Parquet writer classes to accumulate NaN metrics, and then merge the NaN metrics with the file-based metrics from the file footer. We’d do the same thing for ORC, and then that sets up the changes to add metrics for Avro, but we’d need to collect all of the metrics for Avro instead of merging with file metrics. Does that help? rb On Fri, Oct 16, 2020 at 7:00 PM Yan Yan <[email protected]> wrote: > Hi Iceberg community, > > I'm from Amazon and very new to the space, so please bear with me for any > naive questions. > > I'm currently looking into adding NaN counts for float and double columns > (described in #348 <https://github.com/apache/iceberg/pull/348>). I > noticed that metrics like upper/lower bounds and null value counts come > from the individual Parquet/Orc writer themselves during writing (e.g. for > Parquet, `*ParquetWriteAdapter*` exposes metrics from the footer of > parquet library's `*ParquetWriter*`; for ORC, `*OrcFileAppender*` > extracts metrics from ORC library's `*Writer*`; I don't think we have > metrics for avro content files), and while they store things like null > counts and min/max (e.g. `Statistics > <https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/statistics/Statistics.java>` > in Parquet, `ColumnStatistics > <https://github.com/apache/orc/blob/master/java/core/src/java/org/apache/orc/DoubleColumnStatistics.java>` > in ORC), they don't keep NaN counts. > > I was looking into creating a shim layer to maintain the extra NaN counter > on top of them, but it looks like in both writers the statistics updates > are tightly coupled with the writer itself (e.g. in Parquet: ` > ColumnWriterBase > <https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterBase.java#L201>`, > in Orc: `TreeWriter > <https://github.com/apache/orc/blob/master/java/core/src/java/org/apache/orc/impl/writer/TreeWriterBase.java#L180-L222>`), > and this approach also doesn't help us with removing NaN from upper/lower > bounds. > > I'd like to (1) verify my understanding is correct, and (2) gather > suggestions on if there is a better way than updating the parquet/orc > library to add NaN counters and new min/max stats that don't count NaN. > > Thank you! > Yan > -- Ryan Blue Software Engineer Netflix
