Kontinuation commented on code in PR #240:
URL: https://github.com/apache/parquet-format/pull/240#discussion_r1801562506
##########
src/main/thrift/parquet.thrift:
##########
@@ -1084,6 +1290,9 @@ struct ColumnIndex {
* Same as repetition_level_histograms except for definitions levels.
**/
7: optional list<i64> definition_level_histograms;
+
+ /** A list containing statistics of GEOMETRY logical type for each page */
+ 8: optional list<GeometryStatistics> geometry_stats;
Review Comment:
> Do you have any good benchmark data to create some testing Parquet files
so we can determine if it is worth adding the column index?
I have experimented using several datasets downloaded from [UCR
STAR](https://star.cs.ucr.edu/), including NYC Taxi, PortoTaxi, Tiger Roads and
MS Buildings dataset. The size of parquet data pages is about 1MB. The extra
storage overhead of storing geometry stats in column index is 0.004% ~ 0.01%,
which is quite close to the theoretical 64 bytes per page overhead.
The storage overhead for the column index itself is significant. Adding the
geometry statistics field makes the column index 91% larger.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]