SAY-5 opened a new pull request, #807: URL: https://github.com/apache/arrow-go/pull/807
### Rationale for this change Fixes #806. The `New*Statistics` factories in `parquet/metadata/statistics_types.gen.go` initialize `hasDistinctCount: true` at construction time, so every fresh stat object reports a distinct count of 0. `Encode()` then unconditionally calls `enc.SetDistinctCount(0)`, which makes the writer emit `distinct_count: 0` in both `ColumnChunk.metadata.statistics` and `DataPageHeader.statistics` for every column — even when no distinct count has been computed. The Parquet thrift `Statistics.distinct_count` field is OPTIONAL. Per spec semantics, it must be absent when unknown. Emitting 0 conflates "no distinct values" with "we don't know" and causes byte-asymmetry against parquet-cpp (pyarrow), which leaves the field unset for the same input. `IncDistinct()` is the only call site that should flip `hasDistinctCount` to true, and grepping the repo confirms it is never invoked from the writer paths — so flipping the default is safe. ### What changes are included in this PR? - Remove `hasDistinctCount: true` from the factory initializer in `statistics_types.gen.go.tmpl` and regenerate `statistics_types.gen.go`. `hasNullCount: true` is kept as-is (its semantics differ: `IncNulls` is called per page and `null_count: 0` is meaningful for a writer that observes zero nulls). - Add `TestNewStatisticsDistinctCountUnset` covering Int32/Int64/Float32/Float64/Boolean/ByteArray, asserting both `HasDistinctCount()` and `Encode().HasDistinctCount` start false and flip to true after `IncDistinct`. ### Are these changes tested? Yes. The new test fails on master (HasDistinctCount returns true on a fresh stat object) and passes after the fix. `go test ./parquet/metadata/...` is green. ### Are there any user-facing changes? Yes — writers that previously emitted `distinct_count: 0` for every column will now omit the field, matching the spec and parquet-cpp. Readers already tolerate both, so this is a wire-format normalization, not a breaking change for downstream consumers. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
