etseidl commented on PR #197: URL: https://github.com/apache/parquet-format/pull/197#issuecomment-1700406992
Since we all seem to be in agreement now, it's probably good to list the options available and then make a decision on which to use. My (probably incomplete) list would be: 1. Simply add `SizeStatistics` to `ColumnIndex`. This is the simplest solution, keeps the new data together, and mirrors what is being added to `ColumnMetaData`. The downside is extra storage and work for clients that may not use this new information. 2. Add `RepetitionDefinitionLevelHistogram` to `ColumnIndex` and `unencoded_variable_width_stored_bytes` to `OffsetIndex` (either by adding it as an optional field in the `PageLocation`, or as an optional `list<i64>` in `OffsetIndex`). This is the next simplest to implement, and has modest savings over option 1. This suffers the same drawback that clients are forced to read this extra information. 3. Add a size/location pair to `ColumnMetaData` and a new struct containing `list<SizeStatistics>`, mirroring how `OffsetIndex` is written. This allows clients that have no need for this information to ignore it, and allows clients that don't need the full column/offset indexes access to just the size information, but adds complexity and requires reading a third structure for those clients that will use all three. I think 3 is maybe the most flexible, but since I'd almost always be using all three structures anyway, I'd likely vote for 1 or 2. If forced to pick, I'd probably take 1 right now since I already have it implemented :) I do have the cycles to try out 2 and 3 and can report back if that would be helpful. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org