[ https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17760674#comment-17760674 ]
ASF GitHub Bot commented on PARQUET-2261: ----------------------------------------- wgtmac commented on PR #197: URL: https://github.com/apache/parquet-format/pull/197#issuecomment-1700283600 > As far a performance goes, writing the indexes took 100s of microseconds vs total write times in the seconds 😄 Actually generating the histograms was a larger impact than writing them. Do you have the time spent on collecting the histograms? And what about the average number of records per page and total number of records in the file? The reason I ask for this is that number of pages can significantly affect the page index size. @etseidl From the above result, I am not so worried about the boost in the column index size. IMHO, though the initial design goal of page index is mainly for page filtering, OffsetIndex can be used individually for better I/O planning of pages instead of blindly to read them in sequence. Therefore I do not object to add `SizeStatistics` to the ColumnIndex. The downsize is that people do not need this info have to pay for I/O and thrift deserialization of the SizeStatistics. > [Format] Add statistics that reflect decoded size to metadata > ------------------------------------------------------------- > > Key: PARQUET-2261 > URL: https://issues.apache.org/jira/browse/PARQUET-2261 > Project: Parquet > Issue Type: Improvement > Components: parquet-format > Reporter: Micah Kornfield > Assignee: Micah Kornfield > Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)