mapleFU commented on issue #38877: URL: https://github.com/apache/arrow/issues/38877#issuecomment-1825878054
Some notes: 1. Parquet using `Statistics` [1] to store the `distinct_count`, is an optional field in thrift. `Statistics` can occur in `PageHeader` and `ColumnChunkMetadata`. I think it's a bit hard to maintaining `distinct_count` in PageHeader, so I think it's only ok to store a "ColumnChunk"-level distinct count 2. For "accross multiple ColumnChunkMetadata", in fact, the Statistics only work for one column-chunk. We **cannot** regard it as a whole-file distinct-count. 3. We may need to survey that how other implementation handles `distinct_count` during writing As I said in `DictEncoder`, if user choose dict encoding, it will has a `Dictionary` for non-null values. So, after writing a ColumnChunk, it's ok to get the `distinct_count` from the dictionary. For other encoders, currently we didn't maintain a dict, so it's just impossible to get a `distinct_count` here. [1] https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L244 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
