mapleFU commented on code in PR #34054:
URL: https://github.com/apache/arrow/pull/34054#discussion_r1118029417
##########
cpp/src/parquet/statistics.cc:
##########
@@ -494,6 +494,8 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
int64_t null_count, int64_t distinct_count, bool
has_min_max,
bool has_null_count, bool has_distinct_count,
MemoryPool* pool)
: TypedStatisticsImpl(descr, pool) {
+ has_null_count_ = has_null_count;
+ has_distinct_count_ = has_distinct_count;
Review Comment:
I meet the same problem here, I think the syntax of "has_xxx" is like that,
for a writer:
* Writer can assure that if has right null-count ( if it not has any bugs )
* Currently I found that ndv is never collected. If a user collect ndv in
page1, but not collect ndv in page 2, it should be abandon.
For reader:
* When deserialize, reader should assume that ndv and null_count can be
unset ( but currently, it doesn't work like this)
* Deserialized statistics can call merge, but if either `null_count` or
`ndv` is unset, all null_count should be discarded.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]