alamb opened a new issue, #8608: URL: https://github.com/apache/arrow-rs/issues/8608
**Is your feature request related to a problem or challenge? Please describe what you are trying to do.** @JanKaul reports via Discord: > In the parquet metadata for column chunks is a field for distinct_counts, it is currently not populated or maybe just for dictionary columns. Distinct count statistics play an important role for join order selection, so it would be very good to provide that to a query engine. However, calculating distinct counts is very expensive and probably the reason why it is not done for most parquet writers. The `distinct_count` field is defined here: https://github.com/apache/parquet-format/blob/9fd57b59e0ce1a82a69237dcf8977d3e72a2965d/src/main/thrift/parquet.thrift#L292-L293 The reason this is not written at this time is that computing distinct counts for columns can be quite expensive depending on the type and the datatype (e.g. potentially has all the values, keep track of them, etc) **Describe the solution you'd like** Allow the Rust parquet writers to populate this field somehow **Describe alternatives you've considered** 1. Implement some basic implementation in the writer that is optional (and off by default). Careful memory management (and reporting / limiting) is probably critical 2. Implement an API / callback for populating the statistics -- aka require the user code manage the gathering / fallback I would suggest personally: 1. Built in distinct statistics, enablable per column (as distinct counts are much more important for some columns) 2. Add a memory limit for computing the distinct count, and if that is exceeded stop capturing statistics and write the data without the stats **Additional context** <!-- Add any other context or screenshots about the feature request here. --> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
