etseidl commented on PR #197:
URL: https://github.com/apache/parquet-format/pull/197#issuecomment-1700406992

   Since we all seem to be in agreement now, it's probably good to list the 
options available and then make a decision on which to use. My (probably 
incomplete) list would be:
   
   1. Simply add `SizeStatistics` to `ColumnIndex`. This is the simplest 
solution, keeps the new data together, and mirrors what is being added to 
`ColumnMetaData`. The downside is extra storage and work for clients that may 
not use this new information.
   2. Add `RepetitionDefinitionLevelHistogram` to `ColumnIndex` and 
`unencoded_variable_width_stored_bytes` to `OffsetIndex` (either by adding it 
as an optional field in the `PageLocation`, or as an optional `list<i64>` in 
`OffsetIndex`). This is the next simplest to implement, and has modest savings 
over option 1. This suffers the same drawback that clients are forced to read 
this extra information.
   3. Add a size/location pair to `ColumnMetaData` and a new struct containing 
`list<SizeStatistics>`, mirroring how `OffsetIndex` is written. This allows 
clients that have no need for this information to ignore it, and allows clients 
that don't need the full column/offset indexes access to just the size 
information, but adds complexity and requires reading a third structure for 
those clients that will use all three.
   
   I think 3 is maybe the most flexible, but since I'd almost always be using 
all three structures anyway, I'd likely vote for 1 or 2. If forced to pick, I'd 
probably take 1 right now since I already have it implemented :) I do have the 
cycles to try out 2 and 3 and can report back if that would be helpful.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to