Hi Yan,

I think you’re correct about how everything works right now. Because the
ORC and Parquet writers already keep statistics, Iceberg uses those instead
of keeping its own. And that means that Avro doesn’t yet have stats
implemented.

There’s a great start to adding stats for Avro files in PR #1060
<https://github.com/apache/iceberg/pull/1060>, but I think that the actual
implementation would go a slightly different direction. That PR adds a set
of classes to inspect a record and produce metrics. The problem is that
this is very similar to the writers, which also need to traverse the
structure of a record and write out individual leaf fields. I think it
would be better to combine the two, so that all of the writers will keep
track of metrics. That way, we only traverse the record structure once, and
we only have one set of classes that are type-specific (a FloatWriter would
also gather floating point metrics).

I think the interfaces in that PR are pretty good, we just need to
implement them in the writer tree. For Parquet NaN metrics, I would bring
in those interfaces, implement them in the Parquet writer classes to
accumulate NaN metrics, and then merge the NaN metrics with the file-based
metrics from the file footer. We’d do the same thing for ORC, and then that
sets up the changes to add metrics for Avro, but we’d need to collect all
of the metrics for Avro instead of merging with file metrics.

Does that help?

rb

On Fri, Oct 16, 2020 at 7:00 PM Yan Yan <[email protected]> wrote:

> Hi Iceberg community,
>
> I'm from Amazon and very new to the space, so please bear with me for any
> naive questions.
>
> I'm currently looking into adding NaN counts for float and double columns
> (described in #348 <https://github.com/apache/iceberg/pull/348>). I
> noticed that metrics like upper/lower bounds and null value counts come
> from the individual Parquet/Orc writer themselves during writing (e.g. for
> Parquet, `*ParquetWriteAdapter*` exposes metrics from the footer of
> parquet library's `*ParquetWriter*`; for ORC, `*OrcFileAppender*`
> extracts metrics from ORC library's `*Writer*`; I don't think we have
> metrics for avro content files), and while they store things like null
> counts and min/max (e.g. `Statistics
> <https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/statistics/Statistics.java>`
> in Parquet, `ColumnStatistics
> <https://github.com/apache/orc/blob/master/java/core/src/java/org/apache/orc/DoubleColumnStatistics.java>`
> in ORC), they don't keep NaN counts.
>
> I was looking into creating a shim layer to maintain the extra NaN counter
> on top of them, but it looks like in both writers the statistics updates
> are tightly coupled with the writer itself (e.g. in Parquet: `
> ColumnWriterBase
> <https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterBase.java#L201>`,
> in Orc: `TreeWriter
> <https://github.com/apache/orc/blob/master/java/core/src/java/org/apache/orc/impl/writer/TreeWriterBase.java#L180-L222>`),
> and this approach also doesn't help us with removing NaN from upper/lower
> bounds.
>
> I'd like to (1) verify my understanding is correct, and (2) gather
> suggestions on if there is a better way than updating the parquet/orc
> library to add NaN counters and new min/max stats that don't count NaN.
>
> Thank you!
> Yan
>


-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to