JFinis commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318275562
########## src/main/thrift/parquet.thrift: ########## @@ -977,6 +1073,15 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5: optional list<i64> null_counts + /** + * Repetition and definition level histograms for the pages. + * + * This contains some redundancy with null_counts, however, to accommodate + * the widest range of readers both should be populated when either the max + * definition and repetition level meet the requirements specified in + * RepetitionDefinitionLevelHistogram. + **/ + 6: optional list<RepetitionDefinitionLevelHistogram> repetition_definition_level_histograms Review Comment: > For an individual file, I'm not sure the added complexity is worth the savings. Sidenote: I am maintaining a proprietary implementation and I honestly think that the savings would be worth it for our use case. And our use case is not super special (using Iceberg for storing insane amounts of data, both in the number of rows and in the number of individual tables); I guess a lot of others have a similar use case, so I guess the savings will be worth it for many. Let me explain why I think this is the case: Not only does a list-of-lists design take up more space, also thrift-decoding a list of integers is way faster than thrift-decoding a list of structs of lists. E.g., this is the C++ code for decoding the `null_counts` of the column index (I used a proprietary thrift-compiler that creates a bit different code than the default implementation, but it's effectively equivalent). ``` xfer += iprot->readListBegin(type, size); this->null_counts = container.allocateSpan<int64_t>(size); for (uint32_t i = 0; i < size; ++i) { xfer += iprot->readI64(this->null_counts[i]); } xfer += iprot->readListEnd(); ``` So the hot loop is as efficient as it gets: It just reads I64s. With structs and nested lists, you have to: * allocate lists & structs * read and check the isset flags for all nested optional fields (both lists are declared optional) * Recurse between the different read implementations You could argue that this is still negligible time. But you're doing this for each row group and maybe for multiple columns of a row group, so you can easily run this code 100000 times when processing larger tables. Especially with formats like DeltaLake and Iceberg, files are often smaller than we would wish (as e.g., a small insert generates a small file for just that insert. Thus, we sometimes have row groups with just 100s of rows in them. For such row groups, thrift-decoding the metadata can actually become slower than reading the actual rows in some cases. So thrift-decoding performance does matter, at least in my application. > If one wishes to cache the indexes to speed up queries, the additional memory could be a concern It just so happens that my application does this :). So yes, the more memory we waste, the less indexes we can keep in memory, the fewer cache hits we have, which would be a shame if it was preventable. Yes, we could store the thrift-encoded data. But as layed out above, thrift-decoding also becomes more costly with a list-of-lists design. A ColumnIndex is, e.g., used for a point access (equality predicate on a sorted, partitioned table) and in this case decoding speed matters, as users expect a very quick answer for such a query that returns a single row. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org