mapleFU commented on issue #38877:
URL: https://github.com/apache/arrow/issues/38877#issuecomment-1825878054

   Some notes:
   
   1. Parquet using `Statistics` [1] to store the `distinct_count`, is an 
optional field in thrift. `Statistics` can occur in `PageHeader` and 
`ColumnChunkMetadata`. I think it's a bit hard to maintaining `distinct_count` 
in PageHeader, so I think it's only ok to store a "ColumnChunk"-level distinct 
count
   2. For "accross multiple ColumnChunkMetadata", in fact, the Statistics only 
work for one column-chunk. We **cannot** regard it as a whole-file 
distinct-count.
   3. We may need to survey that how other implementation handles 
`distinct_count` during writing
   
   As I said in `DictEncoder`, if user choose dict encoding, it will has a 
`Dictionary` for non-null values. So, after writing a ColumnChunk, it's ok to 
get the `distinct_count` from the dictionary. For other encoders, currently we 
didn't maintain a dict, so it's just impossible to get a `distinct_count` here.
   
   [1] 
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L244


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to