SAY-5 opened a new pull request, #807:
URL: https://github.com/apache/arrow-go/pull/807

   ### Rationale for this change
   
   Fixes #806. The `New*Statistics` factories in 
`parquet/metadata/statistics_types.gen.go` initialize `hasDistinctCount: true` 
at construction time, so every fresh stat object reports a distinct count of 0. 
`Encode()` then unconditionally calls `enc.SetDistinctCount(0)`, which makes 
the writer emit `distinct_count: 0` in both `ColumnChunk.metadata.statistics` 
and `DataPageHeader.statistics` for every column — even when no distinct count 
has been computed.
   
   The Parquet thrift `Statistics.distinct_count` field is OPTIONAL. Per spec 
semantics, it must be absent when unknown. Emitting 0 conflates "no distinct 
values" with "we don't know" and causes byte-asymmetry against parquet-cpp 
(pyarrow), which leaves the field unset for the same input.
   
   `IncDistinct()` is the only call site that should flip `hasDistinctCount` to 
true, and grepping the repo confirms it is never invoked from the writer paths 
— so flipping the default is safe.
   
   ### What changes are included in this PR?
   
   - Remove `hasDistinctCount: true` from the factory initializer in 
`statistics_types.gen.go.tmpl` and regenerate `statistics_types.gen.go`. 
`hasNullCount: true` is kept as-is (its semantics differ: `IncNulls` is called 
per page and `null_count: 0` is meaningful for a writer that observes zero 
nulls).
   - Add `TestNewStatisticsDistinctCountUnset` covering 
Int32/Int64/Float32/Float64/Boolean/ByteArray, asserting both 
`HasDistinctCount()` and `Encode().HasDistinctCount` start false and flip to 
true after `IncDistinct`.
   
   ### Are these changes tested?
   
   Yes. The new test fails on master (HasDistinctCount returns true on a fresh 
stat object) and passes after the fix. `go test ./parquet/metadata/...` is 
green.
   
   ### Are there any user-facing changes?
   
   Yes — writers that previously emitted `distinct_count: 0` for every column 
will now omit the field, matching the spec and parquet-cpp. Readers already 
tolerate both, so this is a wire-format normalization, not a breaking change 
for downstream consumers.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to