Re: [PR] PARQUET-2471: Add geometry logical type [parquet-format]

via GitHub Tue, 15 Oct 2024 09:43:33 -0700


Kontinuation commented on code in PR #240:
URL: https://github.com/apache/parquet-format/pull/240#discussion_r1801562506



##########
src/main/thrift/parquet.thrift:
##########
@@ -1084,6 +1290,9 @@ struct ColumnIndex {
     * Same as repetition_level_histograms except for definitions levels.
     **/
    7: optional list<i64> definition_level_histograms;
+
+   /** A list containing statistics of GEOMETRY logical type for each page */
+   8: optional list<GeometryStatistics> geometry_stats;

Review Comment:
   > Do you have any good benchmark data to create some testing Parquet files 
so we can determine if it is worth adding the column index?
   
   I have experimented using several datasets downloaded from [UCR 
STAR](https://star.cs.ucr.edu/), including NYC Taxi, PortoTaxi, Tiger Roads and 
MS Buildings dataset. The size of parquet data pages is about 1MB. The extra 
storage overhead of storing geometry stats in column index is 0.004% ~ 0.01%, 
which is quite close to the theoretical 64 bytes per page overhead.
   
   The storage overhead for the column index itself is significant. Adding the 
geometry statistics field makes the column index 91% larger.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] PARQUET-2471: Add geometry logical type [parquet-format]

Reply via email to