etseidl commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1303204476
########## src/main/thrift/parquet.thrift: ########## @@ -974,6 +1050,13 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5: optional list<i64> null_counts + /** + * Repetition and definition level histograms for the pages. + * + * This contains some redundancy with null_counts, however, to accommodate the + * widest range of readers both should be populated. + **/ + 6: optional list<RepetitionDefinitionLevelHistogram> repetition_definition_level_histograms; Review Comment: @gszadovszky > I think the size of a Parquet file in total and also in the different parts (footer, column/offset indexes, pages) impact the related systems significantly. We should be very careful about adding new metadata since they are usually useful if they are written by all the implementations increasing the size of the average Parquet file. I am in total agreement. > I am not against adding new structures to the format that is not potentially useful for everyone, but in this case we should state that the writing of these structures are optional (not only syntactically but semantically as well). Yes, this should be optional. And, in fact, would only be non-zero for byte array columns, so the size impact should not be huge compared to adding the histograms. Or is the argument to not add the histograms to the ColumnIndex? > do you want to scan Parquet files written by any kinds of writers or you have control over the writing? While I have control over the writing, there will be potentially many different implementations used depending on what we're doing. We might use libcudf for some of the ETL and data grooming, spark/parquet-mr for queryting, arrow-rs or polars for bulk import, etc. It would be nice if all implementations converged one day on the same set of functionality. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org