[ 
https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17760674#comment-17760674
 ] 

ASF GitHub Bot commented on PARQUET-2261:
-----------------------------------------

wgtmac commented on PR #197:
URL: https://github.com/apache/parquet-format/pull/197#issuecomment-1700283600

   > As far a performance goes, writing the indexes took 100s of microseconds 
vs total write times in the seconds 😄 Actually generating the histograms was a 
larger impact than writing them.
   
   Do you have the time spent on collecting the histograms? And what about the 
average number of records per page and total number of records in the file? The 
reason I ask for this is that number of pages can significantly affect the page 
index size. @etseidl 
   
   From the above result, I am not so worried about the boost in the column 
index size. IMHO, though the initial design goal of page index is mainly for 
page filtering, OffsetIndex can be used individually for better I/O planning of 
pages instead of blindly to read them in sequence. Therefore I do not object to 
add `SizeStatistics` to the ColumnIndex. The downsize is that people do not 
need this info have to pay for I/O and thrift deserialization of the 
SizeStatistics.
   
   




> [Format] Add statistics that reflect decoded size to metadata
> -------------------------------------------------------------
>
>                 Key: PARQUET-2261
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2261
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-format
>            Reporter: Micah Kornfield
>            Assignee: Micah Kornfield
>            Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to