[
https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17760682#comment-17760682
]
ASF GitHub Bot commented on PARQUET-2261:
-----------------------------------------
etseidl commented on PR #197:
URL: https://github.com/apache/parquet-format/pull/197#issuecomment-1700294131
> Do you have the time spent on collecting the histograms? And what about
the average number of records per page and total number of records in the file?
The reason I ask for this is that number of pages can significantly affect the
page index size.
@wgtmac I'll have to get back to you on that (the data is on my work
computer 😅). The number of rows per page should be around 20000 (but can be a
little lower due to `max_page_size constraints`), but the records per page can
vary wildly in the nested files. I'll get some exact times tomorrow, but IIRC
for the "flat 1" file, the histogram collection was under 30ms once I figured
out how to do that part in parallel (it had been over 60ms with a serial
implementation).
> [Format] Add statistics that reflect decoded size to metadata
> -------------------------------------------------------------
>
> Key: PARQUET-2261
> URL: https://issues.apache.org/jira/browse/PARQUET-2261
> Project: Parquet
> Issue Type: Improvement
> Components: parquet-format
> Reporter: Micah Kornfield
> Assignee: Micah Kornfield
> Priority: Major
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)